Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

A dataset is a file , or group of files , containing JSON documents, one per line. 


Note

The content of each line of the file is a JSON document but. However, the file itself is a plain text file read line by line and is NOT a JSON array.

Panel
titleOn this page

Table of Contents

Code Block
languagetext
themeEclipse
titleExample dataset file content
linenumberstrue
{"id":"A0001","title":"This is a title","content":"Some text."}
{"id":"A0002","title":"This is a title","content":"Some text, more text after a comma.","non-import-text":"This will not be processprocessed"}
{"id":"A0003","content":"Some text.\nMore text in a new line."}
{"id":"NON-IMPORTANT-ID","content":"Only the field configured on the metadata file will be used"}
{"id":"AAAA","content":"Some text for a dummy ID"}
{"id":"AAAA","content":"JSON documents"}
{"id":"AAAA","content":"JSON documents"}
{"id":"AAAA","content":"^^^^^^^^^^^^^^^JSON^^^^^^^^^^^^^^ JSON documentsdocument dodoes not need to be unique"}


  • Datasets must be located under the "Dataset" folder on in your Saga working directory.
  • All dataset files should be under the same folder and at the same level, no No subfolders will be processprocessed.
  • The folder name will be used as the dataset name in the user interface.
  • All files under the dataset folder will be processed regardless of its name but their names except for the ones starting with dot (".").  The only file starting with dot that will be process processed is the ".metadata" file.

A dataset must have a ".metadata" file used to define information needed to process the dataset.

Code Block
languagecss
themeEclipse
titleExample .metadata file
{
	"processFields" : ["title","content"],
	"splitRegex" : "\n"
}


  • processFields - Identifies the fields that needs need to be processed.  In this example whenever , when the dataset loader finds "title" or "content" fields, then those are the only ones that will be processed.
  • splitRegex - Used by Simple Reader Stage to break the text into text blocks.

Create a new dataset

Creating a new dataset is a straight forward straightforward process that only needs you to keep requires that you maintain the already explained format.  This The following are some recommendations and standards that would be nice to follow.

  1. Get data from your source and turn your entries into a JSON structuresstructure.  
    1. Keep the ID field even if it is not process processed to be able to go back and check the source. 
    2. Try to get only the fields that needs processing, ; larger files takes longer to process.
    3. All JSON documents must be formatted into a single line (one document per line) and with no line breakers for the structure (the .  The content may included include them as encoded text - "\r\n" ).
  2. Try to use uniform names for the files with counters, easy to identifyeasily identifiable, as the program will point out problems with the files by name.
  3. The name of the dataset folder is specially particularly important as it will be used on by the UI to refer to the dataset and work with it.
  4. There is no limit to the amount of files in a dataset, as well as there is . There is also no limit to the file size but, we recommend to have .  We recommend having multiple files instead of a single large one as since the process can use multiple threads to work the files in parallel.
Panel

In this page:

toc


Code Block
languagetext
themeEclipse
titleExample inputFile.csv
linenumberstrue
"A0001";"This is a title";"Some text"
"A0002";"This is a title";"Some text, more text after a comma."
"A0003";"Title";"Some text.\nMore text in a new line."
"A0004";"My title";"Some text for a dummy ID"
"A0005";"JSON title";"JSON documents"
"A0006";"JSON title";"JSON documents"
"A0007";"JSON title";"^^^^^^^^^^^^^^ JSON documents does not need to be unique"

This csv file can be read and converted to a JSON output by this Python script:

Code Block
languagepy
themeFadeToGrey
titleExample Python convertData.py
linenumberstrue
import json

# Constants to make everything easier
# Change the paths
# The csv is separated by semicolons

CSV_PATH = 'C:\\inputFile.csv'
JSON_PATH = 'C:\\outFile-{}.json'
# this is the file size in lines
COMMENTS_PER_FILE = 14000

jsonfile = []

with open(CSV_PATH, encoding='utf8') as fp:
    cnt = 1
    fileCnt = 1
    line = fp.readline()
    while line:
        line = fp.readline()
        row = line.split(';')

        try:
            if len(row) >= 2:
                comment = {
                    'id': row[0].strip(),
                    'title': row[1].strip(),
                    'comment': row[2].strip()
                }
                jsonfile.append(comment)
                print(cnt)
            else:
                print("Missing tabs " + line)
        except MemoryError:
            print("Error on: " + line)

        if cnt == COMMENTS_PER_FILE:
            with open(JSON_PATH.format(fileCnt), 'w+') as f:
                print('creating file')

                for jcomment in jsonfile:
                    json.dump(jcomment, f)
                    f.write("\n")

            del jsonfile[:]
            fileCnt += 1
            cnt = 0

        cnt += 1

with open(JSON_PATH.format(fileCnt), 'w+') as f:
    print('creating file')

    for jcomment in jsonfile:
        json.dump(jcomment, f)
        f.write("\n")


This would be the output file:

Code Block
languagetext
titleExample outputFile-1.json
linenumberstrue
{"Id":"A0001","title":"This is a title","comment":"Some text"}
{"Id":"A0002","title":"This is a title","comment":"Some text, more text after a comma."}
{"Id":"A0003","title":"Title","comment":"Some text.\nMore text in a new line."}
{"Id":"A0004","title":"My title","comment":"Some text for a dummy ID"}
{"Id":"A0005","title":"JSON title","comment":"JSON documents"}
{"Id":"A0006","title":"JSON title","comment":"JSON documents"}
{"Id":"A0007","title":"JSON title","comment":"^^^^^^^^^^^^^^ JSON documents does not need to be unique"}