A dataset is a file, or group of files, containing JSON documents, one per line.
Note |
---|
The content of each line of the file is a JSON document but, the file itself is a plain text file read line by line and is NOT a JSON array. |
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
{"id":"A0001","title":"This is a title","content":"Some text."} {"id":"A0002","title":"This is a title","content":"Some text, more text after a comma.","non-import-text":"This will not be process"} {"id":"A0003","content":"Some text.\nMore text in a new line."} {"id":"NON-IMPORTANT-ID","content":"Only the field configured on the metadata file will be used"} {"id":"AAAA","content":"Some text for a dummy ID"} {"id":"AAAA","content":"JSON documents"} {"id":"AAAA","content":"JSON documents"} {"id":"AAAA","content":"^^^^^^^^^^^^^^ JSON documents does not need to be unique"} |
A dataset must have a ".metadata" file used to define information needed to process the dataset.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
{ "processFields" : ["title","content"], "splitRegex" : "\n" } |
Creating a new dataset is a straight forward process that only needs you to keep already explained format. These are some recommendations and standards that would be nice to follow.
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
"A0001";"This is a title";"Some text" "A0002";"This is a title";"Some text, more text after a comma." "A0003";"Title";"Some text.\nMore text in a new line." "A0004";"My title";"Some text for a dummy ID" "A0005";"JSON title";"JSON documents" "A0006";"JSON title";"JSON documents" "A0007";"JSON title";"^^^^^^^^^^^^^^ JSON documents does not need to be unique" |
This csv file can be read and converted to a JSON output by this python script:
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
import json # Constants to make everything easier # Change the paths # The csv is separated by semicolons CSV_PATH = 'C:\\inputFile.csv' JSON_PATH = 'C:\\outFile-{}.json' # this is the file size in lines COMMENTS_PER_FILE = 14000 jsonfile = [] with open(CSV_PATH, encoding='utf8') as fp: cnt = 1 fileCnt = 1 line = fp.readline() while line: line = fp.readline() row = line.split(';') try: if len(row) >= 2: comment = { 'id': row[0].strip(), 'title': row[1].strip(), 'comment': row[2].strip() } jsonfile.append(comment) print(cnt) else: print("Missing tabs " + line) except MemoryError: print("Error on: " + line) if cnt == COMMENTS_PER_FILE: with open(JSON_PATH.format(fileCnt), 'w+') as f: print('creating file') for jcomment in jsonfile: json.dump(jcomment, f) f.write("\n") del jsonfile[:] fileCnt += 1 cnt = 0 cnt += 1 with open(JSON_PATH.format(fileCnt), 'w+') as f: print('creating file') for jcomment in jsonfile: json.dump(jcomment, f) f.write("\n") |
This would be the output file:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
{"Id":"A0001","title":"This is a title","comment":"Some text"} {"Id":"A0002","title":"This is a title","comment":"Some text, more text after a comma."} {"Id":"A0003","title":"Title","comment":"Some text.\nMore text in a new line."} {"Id":"A0004","title":"My title","comment":"Some text for a dummy ID"} {"Id":"A0005","title":"JSON title","comment":"JSON documents"} {"Id":"A0006","title":"JSON title","comment":"JSON documents"} {"Id":"A0007","title":"JSON title","comment":"^^^^^^^^^^^^^^ JSON documents does not need to be unique"} |
Panel | |
---|---|
In this page:
|