Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

Creates a bag of words / tfidf tag with the vector information for the document/text_block/sentence. Accumulates the vector until the engine cannot read any further


Operates On:  all lexical Items.

Saga_is_recognizer
Recognizerfalse

Include Page
Generic Configuration Parameters
Generic Configuration Parameters

Configuration Parameters

  • Parameter
    summaryJSON map resource in which the vocabulary is stored
    namevocabulary
    requiredtrue
  • Parameter
    summaryType of algorithm to use then building the vector, can be either BOW or TF_IDF
    defaultBOW
    namevectorType
    requiredtrue
  • Parameter
    summaryDataset ID from which the vocabulary was extracted
    namedatasetId
    requiredtrue
  • Parameter
    summaryMinimum number of tokens to match
    default1
    namemin
    typeinteger
    requiredtrue
  • Parameter
    summaryMaximum number of tokens to match
    default2
    namemax
    typeinteger
    requiredtrue


Saga_config_stage
"vocabulary": "saga-provider:saga_vocabulary",
"vectorType": "BOW",
"datasetId": "dataset-234ifgbqafgoail3",
"min": 1,
"max": 3,

Example Output

In this example the stage load a predefined vocabulary to generate a vector for the sentence using BOW, the same is done but using TF_IDF

Saga_graph
V---------------------------[The pilot landed safely the aircraft after gear failed when approaching the runaway.]----------------------------V 
^-[The]-V-[pilot]-V-[landed]-V-[safely]-V-[the]-V-[aircraft]-V-[after]-V-[gear]-V-[failed]-V-[when]-V-[approaching]-V-[the]-V---[runaway.]----^ 
^-[the]-^         ^---[landed safely]---^---[the aircraft]---^---[after gear]---^---[failed when]---^---[approaching the]---^-[runaway]-V-[.]-^ 
        ^---[pilot landed]---^---[safely the]---^---[aircraft after]---^---[gear failed]---^---[when approaching]---^       ^---[runaway .]---^ 
^---[The pilot]---^                                                                                                 ^-----[the runaway.]------^ 
^---[the pilot]---^                                                                                                 ^---[the runaway]---^ 
^-------------------------------------------------------------------[{BOW}]-------------------------------------------------------------------^ 


V---------------------------[The pilot landed safely the aircraft after gear failed when approaching the runaway.]----------------------------V 
^-[The]-V-[pilot]-V-[landed]-V-[safely]-V-[the]-V-[aircraft]-V-[after]-V-[gear]-V-[failed]-V-[when]-V-[approaching]-V-[the]-V---[runaway.]----^ 
^-[the]-^         ^---[landed safely]---^---[the aircraft]---^---[after gear]---^---[failed when]---^---[approaching the]---^-[runaway]-V-[.]-^ 
        ^---[pilot landed]---^---[safely the]---^---[aircraft after]---^---[gear failed]---^---[when approaching]---^       ^---[runaway .]---^ 
^---[The pilot]---^                                                                                                 ^-----[the runaway.]------^ 
^---[the pilot]---^                                                                                                 ^---[the runaway]---^ 
^-----------------------------------------------------------------[{TF_IDF}]------------------------------------------------------------------^ 

Output Flags

Lex-Item Flags:

  • WEIGHT_VECTOR - Identifies the tag as a weight vector representation of a sentence
  • TOKEN - Identifies that the Lex-Items produced by this stage are tokens and not text blocks.

Vertex Flags:

Info

No vertices are created in this stage


Resource Data

Description of resource.

Resource Format

Saga_json
TitleVocabulary Format
"count" : 15,
"docsPerTerm" : 15,
"datasetId" : "f92e1394-5f52-3331-aa6a-9c510ad31da5",
"tokenCount" : 1,
"docCount" : 204021,
"word" : "depict"


Fields

  • Parameter
    summarynumber of time the word appeared
    namecount
    typeinteger
    requiredtrue
  • Parameter
    summaryNumber of document in which the word appeared
    namedocsPerTerm
    typeinteger
    requiredtrue
  • Parameter
    summarydataset ID in from which the vocabulary was extracted
    namedatasetId
    requiredtrue
  • Parameter
    summarynumber of tokens for the word
    nametokenCount
    typeinteger
    requiredtrue
  • Parameter
    summarynumber of documents in the dataset
    namedocCount
    typeinteger
    requiredtrue
  • Parameter
    summaryword of the vocabulary
    nameword
    requiredtrue