A dataset is a file, or group of files, containing JSON documents, one per line.
Note |
---|
The content of each line of the file is a JSON document but, the file itself is a plain text file read line by line and is NOT a JSON array. |
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
{"id":"A0001","title":"This is a title","content":"Some text."} {"id":"A0002","title":"This is a title","content":"Some text, more text after a comma.","non-import-text":"This will not be process"} {"id":"A0003","content":"Some text.\nMore text in a new line."} {"id":"NON-IMPORTANT-ID","content":"Only the field configured on the metadata file will be used"} {"id":"AAAA","content":"Some text for a dummy ID"} {"id":"AAAA","content":"JSON documents"} {"id":"AAAA","content":"JSON documents"} {"id":"AAAA","content":"^^^^^^^^^^^^^^^JSON documents do not need to be unique"} |
A dataset must have a ".metadata" file used to define information needed to process the dataset.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
{ "processFields" : ["title","content"], "splitRegex" : "\n" } |
Creating a new dataset is straight forward process that only needs to stick the already explained format. This example dataset will guide you with some recommendations and standards that would be nice to follow.
Panel | |
---|---|
In this page:
|