A dataset is a file , or group of files , containing JSON documents, one per line.
Note |
---|
The content of each line of the file is a JSON document but. However, the file itself is a plain text file read line by line and is NOT a JSON array. |
Panel | ||
---|---|---|
| ||
|
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
{"id":"A0001","title":"This is a title","content":"Some text."} {"id":"A0002","title":"This is a title","content":"Some text, more text after a comma.","non-import-text":"This will not be processprocessed"} {"id":"A0003","content":"Some text.\nMore text in a new line."} {"id":"NON-IMPORTANT-ID","content":"Only the field configured on the metadata file will be used"} {"id":"AAAA","content":"Some text for a dummy ID"} {"id":"AAAA","content":"JSON documents"} {"id":"AAAA","content":"JSON documents"} {"id":"AAAA","content":"^^^^^^^^^^^^^^^JSON^^^^^^^^^^^^^^ JSON documentsdocument dodoes not need to be unique"} |
A dataset must have a ".metadata" file used to define information needed to process the dataset.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
{ "processFields" : ["title","content"], "splitRegex" : "\n" } |
Creating a new dataset is straight forward a straightforward process that only needs to stick requires that you maintain the already explained format. This example dataset will guide you with The following are some recommendations and standards that would be nice to follow.
In this page:
tocCode Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
"A0001";"This is a title";"Some text"
"A0002";"This is a title";"Some text, more text after a comma."
"A0003";"Title";"Some text.\nMore text in a new line."
"A0004";"My title";"Some text for a dummy ID"
"A0005";"JSON title";"JSON documents"
"A0006";"JSON title";"JSON documents"
"A0007";"JSON title";"^^^^^^^^^^^^^^ JSON documents does not need to be unique" |
This csv file can be read and converted to a JSON output by this Python script:
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
import json
# Constants to make everything easier
# Change the paths
# The csv is separated by semicolons
CSV_PATH = 'C:\\inputFile.csv'
JSON_PATH = 'C:\\outFile-{}.json'
# this is the file size in lines
COMMENTS_PER_FILE = 14000
jsonfile = []
with open(CSV_PATH, encoding='utf8') as fp:
cnt = 1
fileCnt = 1
line = fp.readline()
while line:
line = fp.readline()
row = line.split(';')
try:
if len(row) >= 2:
comment = {
'id': row[0].strip(),
'title': row[1].strip(),
'comment': row[2].strip()
}
jsonfile.append(comment)
print(cnt)
else:
print("Missing tabs " + line)
except MemoryError:
print("Error on: " + line)
if cnt == COMMENTS_PER_FILE:
with open(JSON_PATH.format(fileCnt), 'w+') as f:
print('creating file')
for jcomment in jsonfile:
json.dump(jcomment, f)
f.write("\n")
del jsonfile[:]
fileCnt += 1
cnt = 0
cnt += 1
with open(JSON_PATH.format(fileCnt), 'w+') as f:
print('creating file')
for jcomment in jsonfile:
json.dump(jcomment, f)
f.write("\n")
|
This would be the output file:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
{"Id":"A0001","title":"This is a title","comment":"Some text"}
{"Id":"A0002","title":"This is a title","comment":"Some text, more text after a comma."}
{"Id":"A0003","title":"Title","comment":"Some text.\nMore text in a new line."}
{"Id":"A0004","title":"My title","comment":"Some text for a dummy ID"}
{"Id":"A0005","title":"JSON title","comment":"JSON documents"}
{"Id":"A0006","title":"JSON title","comment":"JSON documents"}
{"Id":"A0007","title":"JSON title","comment":"^^^^^^^^^^^^^^ JSON documents does not need to be unique"} |