Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

Creates a bag of words / tfidf tag with the vector information for the document/text_block/sentence. Accumulates the vector until the engine cannot read any further


Operates On:  Lexical Items with TOKEN and possibly other flags as specified below  all lexical Items.

Saga_is_recognizer
Recognizerfalse

Warning

This stage is disabled in version 1.2.2

Include Page
Generic Configuration Parameters
Generic Configuration Parameters

Configuration Parameters

  • Parameter
    summaryJSON map resource in which the vocabulary is stored
    namevocabulary
    requiredtrue
  • Parameter
    summaryType of algorithm to use then building the vector, can be either BOW or TF_IDF
    defaultBOW
    namevectorType
    requiredtrue
  • Parameter
    summaryDataset ID from which the vocabulary was extracted
    namedatasetId
    requiredtrue
  • Parameter
    summaryMinimum number of tokens to match
    default1
    namemin
    typeinteger
    requiredtrue
  • Parameter
    summaryMaximum number of tokens to match
    default2
    namemax
    typeinteger
    requiredtrue


Saga_config_stage
"vocabulary": "saga-provider:saga_vocabulary",
"vectorType": "BOW",
"datasetId": "dataset-234ifgbqafgoail3",
"min": 1,
"max": 3,

Example Output

In this example the stage load a predefined vocabulary to generate a vector for the sentence using BOW, the same is done but using TF_IDF

Saga_graph
V---------------------------[The pilot landed safely the aircraft after gear failed when approaching the runaway.]----------------------------V 
^-[The]-V-[pilot]-V-[landed]-V-[safely]-V-[the]-V-[aircraft]-V-[after]-V-[gear]-V-[failed]-V-[when]-V-[approaching]-V-[the]-V---[runaway.]----^ 
^-[the]-^         ^---[landed safely]---^---[the aircraft]---^---[after gear]---^---[failed when]---^---[approaching the]---^-[runaway]-V-[.]-^ 
        ^---[pilot landed]---^---[safely the]---^---[aircraft after]---^---[gear failed]---^---[when approaching]---^       ^---[runaway .]---^ 
^---[The pilot]---^                                                                                                 ^-----[the runaway.]------^ 
^---[the pilot]---^                                                                                                 ^---[the runaway]---^ 
^-------------------------------------------------------------------[{BOW}]-------------------------------------------------------------------^ 


V---------------------------[The pilot landed safely the aircraft after gear failed when approaching the runaway.]----------------------------V 
^-[The]-V-[pilot]-V-[landed]-V-[safely]-V-[the]-V-[aircraft]-V-[after]-V-[gear]-V-[failed]-V-[when]-V-[approaching]-V-[the]-V---[runaway.]----^ 
^-[the]-^         ^---[landed safely]---^---[the aircraft]---^---[after gear]---^---[failed when]---^---[approaching the]---^-[runaway]-V-[.]-^ 
        ^---[pilot landed]---^---[safely the]---^---[aircraft after]---^---[gear failed]---^---[when approaching]---^       ^---[runaway .]---^ 
^---[The pilot]---^                                                                                                 ^-----[the runaway.]------^ 
^---[the pilot]---^                                                                                                 ^---[the runaway]---^ 
^-----------------------------------------------------------------[{TF_IDF}]------------------------------------------------------------------^ 

Output Flags

Lex-Item Flags:

  • WEIGHT_VECTOR - Identifies the tag as a weight vector representation of a sentence
  • TOKEN - Identifies that the Lex-Items produced by this stage are tokens and not text blocks.

Vertex Flags:

Info

No vertices are created in this stage


Resource Data

Description of resource.

Resource Format

Saga_json
TitleEntity Json Vocabulary Format
"_idcount" : 15,
"KGAAJGsBemSwA0nZTLXAdocsPerTerm" : 15,
"tagdatasetId" : "recipef92e1394-5f52-3331-aa6a-9c510ad31da5",
"patterntokenCount" : 1,
"(docCount"how many"|"how much") {ingredient} ": 204021,
"confAdjustword" : 0.95

. . . additional fields as needed go here . . . 
Note
  • Multiple entries can have the same pattern. If the pattern is matched, then it will be tagged with multiple (ambiguous) entry IDs.
  • Additional fielded data can be added to the record; as needed by downstream processes.

Fields

"depict"


Fields

  • Parameter
    summarynumber of time the word appeared
    namecount
    typeinteger
    requiredtrue
  • Parameter
    summaryNumber of document in which the word appeared
    namedocsPerTerm
    typeinteger
    Parameter
    summaryWhat to show the user when browsing this entity
    namedisplay
    requiredtrue
  • Parameter
    summaryTag which will identify any match in the graph, as an interpretationdataset ID in from which the vocabulary was extracted
    nametagdatasetId
    requiredtrue

    These will all be added to the interpretation graph with the SEMANTIC_TAG flag.

  • Parameter
    summarynumber of tokens for the word
    nametokenCount
    typeinteger
    requiredtrue
    TipTags are hierarchical representations of the same intent. For example, {city} → {administrative-area} → {geographical-area}
  • Parameter
    summary
    Pattern to match
    number of documents in the
    content
    dataset
    name
    pattern
    docCount
    typeinteger
    requiredtrue
Include PageGeneric Resource Fields
  • Parameter
    summaryword of the vocabulary
    nameword
    requiredtrue

Generic Resource Fields