A dataset is a file, or group of files, containing JSON documents, one per line.
Note |
---|
The content of each line of the file is a JSON document but, the file itself is a plain text file read line by line and is NOT a JSON array. |
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
{"id":"A0001","title":"This is a title","content":"Some text."} {"id":"A0002","title":"This is a title","content":"Some text, more text after a comma.","non-import-text":"This will not be process"} {"id":"A0003","content":"Some text.\nMore text in a new line."} {"id":"NON-IMPORTANT-ID","content":"Only the field configured on the metadata file will be used"} {"id":"AAAA","content":"Some text for a dummy ID"} {"id":"AAAA","content":"JSON documents"} {"id":"AAAA","content":"JSON documents"} {"id":"AAAA","content":"^^^^^^^^^^^^^^^JSON^^^^^^^^^^^^^^ JSON documents do not need to be unique"} |
A dataset must have a ".metadata" file used to define information needed to process the dataset.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
{ "processFields" : ["title","content"], "splitRegex" : "\n" } |
Creating a new dataset is a straight forward process that only needs you to keep already explained format. These are some recommendations and standards that would be nice to follow.
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
"A0001","This is a title","Some text"
"A0002","This is a title","Some text, more text after a comma."
"A0003","","content":"Some text.\nMore text in a new line."
"NON-IMPORTANT-ID","Title 1","Only the field configured on the metadata file will be used"
"AAAA","My title","Some text for a dummy ID"
"AAAA","JSON title","JSON documents"
"AAAA","JSON title","JSON documents"
"AAAA","JSON title","^^^^^^^^^^^^^^ JSON documents do not need to be unique" |
This csv file can be read and converted to a JSON output by this python script:
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
import json # Constants to make everything easier # Change the paths # The csv is separated by tabs CSV_PATH = 'C:\\inputFile.csv' JSON_PATH = 'C:\\outFile-{}.json' # this is the file size on lines COMMENTS_PER_FILE = 14000 jsonfile = [] with open(CSV_PATH, encoding='utf8') as fp: cnt = 1 fileCnt = 1 line = fp.readline() while line: line = fp.readline() row = line.split('\t') try: if len(row) >= 2: comment = { 'id': row[0].strip(), 'title': row[1].strip(), 'comment': row[2].strip() } jsonfile.append(comment) print(cnt) else: print("Missing tabs " + line) except MemoryError: print("Error on: " + line) if cnt == COMMENTS_PER_FILE: with open(JSON_PATH.format(fileCnt), 'w+') as f: print('creating file') for jcomment in jsonfile: json.dump(jcomment, f) f.write("\n") del jsonfile[:] fileCnt += 1 cnt = 0 cnt += 1 with open(JSON_PATH.format(fileCnt), 'w+') as f: print('creating file') for jcomment in jsonfile: json.dump(jcomment, f) f.write("\n") |
Panel | |
---|---|
In this page:
|