A dataset is a file , or group of files , containing JSON documents, one per line.

Note
The content of each line of the file is a JSON document but. However, the file itself is a plain text file read line by line and is NOT a JSON array.

Panel

title	On this page

Table of Contents

Code Block

language	text
theme	Eclipse
title	Example dataset file content
linenumbers	true

{"id":"A0001","title":"This is a title","content":"Some text."}
{"id":"A0002","title":"This is a title","content":"Some text, more text after a comma.","non-import-text":"This will not be processprocessed"}
{"id":"A0003","content":"Some text.\nMore text in a new line."}
{"id":"NON-IMPORTANT-ID","content":"Only the field configured on the metadata file will be used"}
{"id":"AAAA","content":"Some text for a dummy ID"}
{"id":"AAAA","content":"JSON documents"}
{"id":"AAAA","content":"JSON documents"}
{"id":"AAAA","content":"^^^^^^^^^^^^^^^JSON^^^^^^^^^^^^^^ JSON documentsdocument dodoes not need to be unique"}

Datasets must be located under the "Dataset" folder on in your Saga working directory.
All dataset files should be under the same folder and at the same level, no No subfolders will be processprocessed.
The folder name will be used as the dataset name in the user interface.
All files under the dataset folder will be processed regardless of its name but their names except for the ones starting with dot ("."). The only file starting with dot that will be process processed is the ".metadata" file.

A dataset must have a ".metadata" file used to define information needed to process the dataset.

Code Block

language	css
theme	Eclipse
title	Example .metadata file

{
	"processFields" : ["title","content"],
	"splitRegex" : "\n"
}

processFields - Identifies the fields that needs need to be processed. In this example whenever , when the dataset loader finds "title" or "content" fields, then those are the only ones that will be processed.
splitRegex - Used by Simple Reader Stage to break the text into text blocks.

Create a new dataset

Creating a new dataset is a straight forward straightforward process that only needs you to keep requires that you maintain the already explained format. This The following are some recommendations and standards that would be nice to follow.

Get data from your source and turn your entries into a JSON structuresstructure.
1. Keep the ID field even if it is not process processed to be able to go back and check the source.
2. Try to get only the fields that needs processing, ; larger files takes longer to process.
3. All JSON documents must be formatted into a single line (one document per line) and with no line breakers for the structure (the . The content may included include them as encoded text - "\r\n" ).
Try to use uniform names for the files with counters, easy to identifyeasily identifiable, as the program will point out problems with the files by name.
The name of the dataset folder is specially particularly important as it will be used on by the UI to refer to the dataset and work with it.
There is no limit to the amount of files in a dataset, as well as there is . There is also no limit to the file size but, we recommend to have . We recommend having multiple files instead of a single large one as since the process can use multiple threads to work the files in parallel.

Panel

In this page:

toc

Code Block

language	text
theme	Eclipse
title	Example inputFile.csv
linenumbers	true

"A0001";"This is a title";"Some text"
"A0002";"This is a title";"Some text, more text after a comma."
"A0003";"Title";"Some text.\nMore text in a new line."
"A0004";"My title";"Some text for a dummy ID"
"A0005";"JSON title";"JSON documents"
"A0006";"JSON title";"JSON documents"
"A0007";"JSON title";"^^^^^^^^^^^^^^ JSON documents does not need to be unique"

This csv file can be read and converted to a JSON output by this Python script:

Code Block

language	py
theme	FadeToGrey
title	Example Python convertData.py
linenumbers	true

import json

# Constants to make everything easier
# Change the paths
# The csv is separated by semicolons

CSV_PATH = 'C:\\inputFile.csv'
JSON_PATH = 'C:\\outFile-{}.json'
# this is the file size in lines
COMMENTS_PER_FILE = 14000

jsonfile = []

with open(CSV_PATH, encoding='utf8') as fp:
    cnt = 1
    fileCnt = 1
    line = fp.readline()
    while line:
        line = fp.readline()
        row = line.split(';')

        try:
            if len(row) >= 2:
                comment = {
                    'id': row[0].strip(),
                    'title': row[1].strip(),
                    'comment': row[2].strip()
                }
                jsonfile.append(comment)
                print(cnt)
            else:
                print("Missing tabs " + line)
        except MemoryError:
            print("Error on: " + line)

        if cnt == COMMENTS_PER_FILE:
            with open(JSON_PATH.format(fileCnt), 'w+') as f:
                print('creating file')

                for jcomment in jsonfile:
                    json.dump(jcomment, f)
                    f.write("\n")

            del jsonfile[:]
            fileCnt += 1
            cnt = 0

        cnt += 1

with open(JSON_PATH.format(fileCnt), 'w+') as f:
    print('creating file')

    for jcomment in jsonfile:
        json.dump(jcomment, f)
        f.write("\n")

This would be the output file:

Code Block

language	text
title	Example outputFile-1.json
linenumbers	true

{"Id":"A0001","title":"This is a title","comment":"Some text"}
{"Id":"A0002","title":"This is a title","comment":"Some text, more text after a comma."}
{"Id":"A0003","title":"Title","comment":"Some text.\nMore text in a new line."}
{"Id":"A0004","title":"My title","comment":"Some text for a dummy ID"}
{"Id":"A0005","title":"JSON title","comment":"JSON documents"}
{"Id":"A0006","title":"JSON title","comment":"JSON documents"}
{"Id":"A0007","title":"JSON title","comment":"^^^^^^^^^^^^^^ JSON documents does not need to be unique"}

Page tree

Versions Compared

Old Version 4

New Version Current

Key

Create a new dataset

Page tree

Page History

Versions Compared

Old Version 4

New Version Current

Key

Create a new dataset