You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

A dataset is a file, or group of files, containing JSON documents, one per line. 

The content of each line of the file is a JSON document but, the file itself is a plain text file read line by line and is NOT a JSON array.

Example dataset file content
{"id":"A0001","title":"This is a title","content":"Some text."}
{"id":"A0002","title":"This is a title","content":"Some text, more text after a comma.","non-import-text":"This will not be process"}
{"id":"A0003","content":"Some text.\nMore text in a new line."}
{"id":"NON-IMPORTANT-ID","content":"Only the field configured on the metadata file will be used"}
{"id":"AAAA","content":"Some text for a dummy ID"}
{"id":"AAAA","content":"JSON documents"}
{"id":"AAAA","content":"JSON documents"}
{"id":"AAAA","content":"^^^^^^^^^^^^^^ JSON documents does not need to be unique"}


  • Datasets must be located under the "Dataset" folder on your Saga working directory.
  • All dataset files should be under the same folder and at the same level, no subfolders will be process.
  • The folder name will be used as the dataset name in the user interface.
  • All files under the dataset folder will be processed regardless of its name but the ones starting with dot (".").  The only file starting with dot that will be process is the ".metadata" file.

A dataset must have a ".metadata" file used to define information needed to process the dataset.

Example .metadata file
{
	"processFields" : ["title","content"],
	"splitRegex" : "\n"
}


  • processFields - Identifies the fields that needs to be processed.  In this example whenever the dataset loader finds "title" or "content" fields those are the only ones that will be processed.
  • splitRegex - Used by Simple Reader Stage to break the text into text blocks.

Create a new dataset

Creating a new dataset is a straight forward process that only needs you to keep already explained format.  These are some recommendations and standards that would be nice to follow.

  1. Get data from your source and turn your entries into a JSON structures.  
    1. Keep the ID field even if it is not process to be able to go back and check the source. 
    2. Try to get only the fields that needs processing, larger files takes longer to process.
    3. All JSON documents must be formatted into a single line (one document per line) and no line breakers for the structure (the content may included as encoded text - "\r\n" ).
  2. Try to use uniform names for the files with counters, easy to identify, as the program will point out problems with the files by name.
  3. The name of the dataset folder is specially important as it will be used on the UI to refer to the dataset and work with it.
  4. There is no limit to the amount of files in a dataset, as well as there is no limit to the file size but, we recommend to have multiple files instead of a single large one as the process can use multiple threads to work the files in parallel.


Example inputFile.csv
"A0001";"This is a title";"Some text"
"A0002";"This is a title";"Some text, more text after a comma."
"A0003";"Title";"Some text.\nMore text in a new line."
"A0004";"My title";"Some text for a dummy ID"
"A0005";"JSON title";"JSON documents"
"A0006";"JSON title";"JSON documents"
"A0007";"JSON title";"^^^^^^^^^^^^^^ JSON documents does not need to be unique"

This csv file can be read and converted to a JSON output by this python script:

Example Python convertData.py
import json

# Constants to make everything easier
# Change the paths
# The csv is separated by semicolons

CSV_PATH = 'C:\\inputFile.csv'
JSON_PATH = 'C:\\outFile-{}.json'
# this is the file size in lines
COMMENTS_PER_FILE = 14000

jsonfile = []

with open(CSV_PATH, encoding='utf8') as fp:
    cnt = 1
    fileCnt = 1
    line = fp.readline()
    while line:
        line = fp.readline()
        row = line.split(';')

        try:
            if len(row) >= 2:
                comment = {
                    'id': row[0].strip(),
                    'title': row[1].strip(),
                    'comment': row[2].strip()
                }
                jsonfile.append(comment)
                print(cnt)
            else:
                print("Missing tabs " + line)
        except MemoryError:
            print("Error on: " + line)

        if cnt == COMMENTS_PER_FILE:
            with open(JSON_PATH.format(fileCnt), 'w+') as f:
                print('creating file')

                for jcomment in jsonfile:
                    json.dump(jcomment, f)
                    f.write("\n")

            del jsonfile[:]
            fileCnt += 1
            cnt = 0

        cnt += 1

with open(JSON_PATH.format(fileCnt), 'w+') as f:
    print('creating file')

    for jcomment in jsonfile:
        json.dump(jcomment, f)
        f.write("\n")


This would be the output file:

Example outputFile-1.json
{"Id":"A0001","title":"This is a title","comment":"Some text"}
{"Id":"A0002","title":"This is a title","comment":"Some text, more text after a comma."}
{"Id":"A0003","title":"Title","comment":"Some text.\nMore text in a new line."}
{"Id":"A0004","title":"My title","comment":"Some text for a dummy ID"}
{"Id":"A0005","title":"JSON title","comment":"JSON documents"}
{"Id":"A0006","title":"JSON title","comment":"JSON documents"}
{"Id":"A0007","title":"JSON title","comment":"^^^^^^^^^^^^^^ JSON documents does not need to be unique"}

In this page:

  • No labels