You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 10 Next »

A dataset is a file or group of files containing JSON documents, one per line. 

The content of each line of the file is a JSON document. However, the file itself is a plain text file read line by line and is NOT a JSON array.

Example dataset file content
{"id":"A0001","title":"This is a title","content":"Some text."}
{"id":"A0002","title":"This is a title","content":"Some text, more text after a comma.","non-import-text":"This will not be processed"}
{"id":"A0003","content":"Some text.\nMore text in a new line."}
{"id":"NON-IMPORTANT-ID","content":"Only the field configured on the metadata file will be used"}
{"id":"AAAA","content":"Some text for a dummy ID"}
{"id":"AAAA","content":"JSON documents"}
{"id":"AAAA","content":"JSON documents"}
{"id":"AAAA","content":"^^^^^^^^^^^^^^ JSON document does not need to be unique"}


  • Datasets must be located under the "Dataset" folder in your Saga working directory.
  • All dataset files should be under the same folder and at the same level, No subfolders will be processed.
  • The folder name will be used as the dataset name in the user interface.
  • All files under the dataset folder will be processed regardless of their names except for the ones starting with dot (".").  The only file starting with dot that will be processed is the ".metadata" file.

A dataset must have a ".metadata" file to define information needed to process the dataset.

Example .metadata file
{
	"processFields" : ["title","content"],
	"splitRegex" : "\n"
}


  • processFields - Identifies the fields that need to be processed.  In this example, when the dataset loader finds "title" or "content" fields, then those are the only ones that will be processed.
  • splitRegex - Used by Simple Reader Stage to break the text into text blocks.

Create a new dataset

Creating a new dataset is a straightforward process that only requires that you maintain the already explained format.  The following are some recommendations and standards that would be nice to follow.

  1. Get data from your source and turn your entries into a JSON structure.  
    1. Keep the ID field even if it is not processed to be able to go back and check the source. 
    2. Try to get only the fields that needs processing; larger files takes longer to process.
    3. All JSON documents must be formatted into a single line (one document per line) with no line breakers for the structure.  Tthe content may included as encoded text - "\r\n".
  2. Try to use uniform names for the files with counters, easily identifiable, as the program will point out problems with the files by name.
  3. The name of the dataset folder is particularly important as it will be used by the UI to refer to the dataset and work with it.
  4. There is no limit to the amount of files in a dataset. There is also no limit to the file size.  We recommend having multiple files instead of a single large one since the process can use multiple threads to work the files in parallel.


Example inputFile.csv
"A0001";"This is a title";"Some text"
"A0002";"This is a title";"Some text, more text after a comma."
"A0003";"Title";"Some text.\nMore text in a new line."
"A0004";"My title";"Some text for a dummy ID"
"A0005";"JSON title";"JSON documents"
"A0006";"JSON title";"JSON documents"
"A0007";"JSON title";"^^^^^^^^^^^^^^ JSON documents does not need to be unique"

This csv file can be read and converted to a JSON output by this Python script:

Example Python convertData.py
import json

# Constants to make everything easier
# Change the paths
# The csv is separated by semicolons

CSV_PATH = 'C:\\inputFile.csv'
JSON_PATH = 'C:\\outFile-{}.json'
# this is the file size in lines
COMMENTS_PER_FILE = 14000

jsonfile = []

with open(CSV_PATH, encoding='utf8') as fp:
    cnt = 1
    fileCnt = 1
    line = fp.readline()
    while line:
        line = fp.readline()
        row = line.split(';')

        try:
            if len(row) >= 2:
                comment = {
                    'id': row[0].strip(),
                    'title': row[1].strip(),
                    'comment': row[2].strip()
                }
                jsonfile.append(comment)
                print(cnt)
            else:
                print("Missing tabs " + line)
        except MemoryError:
            print("Error on: " + line)

        if cnt == COMMENTS_PER_FILE:
            with open(JSON_PATH.format(fileCnt), 'w+') as f:
                print('creating file')

                for jcomment in jsonfile:
                    json.dump(jcomment, f)
                    f.write("\n")

            del jsonfile[:]
            fileCnt += 1
            cnt = 0

        cnt += 1

with open(JSON_PATH.format(fileCnt), 'w+') as f:
    print('creating file')

    for jcomment in jsonfile:
        json.dump(jcomment, f)
        f.write("\n")


This would be the output file:

Example outputFile-1.json
{"Id":"A0001","title":"This is a title","comment":"Some text"}
{"Id":"A0002","title":"This is a title","comment":"Some text, more text after a comma."}
{"Id":"A0003","title":"Title","comment":"Some text.\nMore text in a new line."}
{"Id":"A0004","title":"My title","comment":"Some text for a dummy ID"}
{"Id":"A0005","title":"JSON title","comment":"JSON documents"}
{"Id":"A0006","title":"JSON title","comment":"JSON documents"}
{"Id":"A0007","title":"JSON title","comment":"^^^^^^^^^^^^^^ JSON documents does not need to be unique"}

In this page:

  • No labels