Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

Identifies patterns with a combination of any number of specified tokens, regardless of the surrounding tokens

...

.


Operates On:  Lexical Items with TOKEN or SEMANTIC_TAG and

...

other possible flags as specified below, but not on TEXT_BLOCK.

Saga_is_recognizer

Include Page
Generic Configuration Parameters

...

Generic Configuration Parameters

Configuration Parameters

  • Parameter
    summaryThe resource that contains the pattern database
  • See below for the format.

...

  • The tokens to process must be inside two vertex mark with this flags (e.g ["TEXT_BLOCK_SPLIT"])

...

  • Tokens marked with this flags will be ignore by this stage, and no process will be performed.

...

  • Tokens need to have all the specified flags, in order to be processed

...

  • Enable all debug log functionality of the stage, if any.
  • .
    namepatterns
    requiredtrue
Info

In version 1.2.2 this parameter was added:

  • Parameter
    summaryIf true the stage will prefer larger patterns
    defaulttrue
    namepreferLarge
    typeboolean
Saga_config_stage
boundaryFlagstext block split
stageFragmentation
requiredFlagstoken, semantic tag
skipFlagsskip
"patterns":"saga_provider:fragmented_patterns",
"maxRepeats": 5

Example Output

Saga_graph
Code Block
languagejs
themeEclipse
titleExample Configuration
{
 "type":"Fragmentation",
 "patameter":"fragmented-provider:patterns",
 "boundaryFlags":["TEXT_BLOCK_SPLIT"]
}

Example Output

Description

...

V--------------[abraham lincoln likes macaroni and cheese]--------------------V
^--[abraham]--V--[lincoln]--V--[likes]--V--[macaroni]--V--[and]--V--[cheese]--^

...

              ^---{place}---^           ^----{food}----^         ^---{food}---^
^----------{person}---------^           ^-----------------{food}--------------^

Output Flags

Lex-Item Flags:

  • SEMANTIC_TAG - Identifies all lexical items

...

  • that are semantic tags.
  • FRAGMENT- Identifies all lexical items

...

  • that were created from a fragmentation pattern.

...

Vertex Flags:

...

Info

No vertices are created in this stage


Resource Data

The resource data

...

is a database of fragmented patterns, and the resulting semantic

...

tags they produce.

Resource Format

The only required file

...

is the entity dictionary. It is a series of JSON records, typically indexed by entity ID.

Description of entity

...

Entity JSON

...

Format 

...

Anchor

...

resourceFormat

...

resourceFormat

...

Saga_json
TitleEntity

...

Json Format

...

"

...

tag": "

...

{city}",

...

"

...

pattern":

...

 "

...

("

...

how many"

...

|"how much") {ingredient} ",
"confAdjust": 0.95
. . . additional fields as needed go here . . . 


Code Block
languagejs
themeEclipse
titleEntity JSON Format
"_id" : "KGAAJGsBemSwA0nZTLXA",
"tag":["recipe"],
"pattern": "{number} {ingredient}",
"options": {

...

  "minTokens": 3,

...

  "maxTokens": 

...

2,
  

...

"combination": true

...

}

...

"

...

confAdjust":0.95
 

...


...

. . . additional fields as needed go here . . . 

...

...

Note
  • Multiple

...

  • entries can have the same pattern.

...

  • If the pattern is matched, then it will be tagged with multiple (ambiguous)

...

  • entry IDs.
  • Additional fielded data can be added to the record

...

  • ; as needed by downstream processes.


Fields

...

  • Typically this is an identifier with meaning to the larger application which is using the Language Processing Toolkit.
  • Parameter
    summaryTag which will identify any match in the graph, as an interpretation
    nametag
    requiredtrue

...

    • These will all be added to the interpretation graph with the SEMANTIC_TAG flag.

...

    • Tip

      Tags are hierarchical representations of the same intent. For example, {city} → {administrative-area} → {geographical-area}

...

  • Parameter
    summaryPattern to match in the content

...

  • Patterns will be tokenized and there may be multiple variations which can match.
    • NOTE:  Currenty, tokens are separated on simple white-space and punctuation, and then reduced to lowercase.
    • TODO:  This will need to be improved in the future, perhaps by specifying a pipeline to perform the tokenization and to allow for multiple variations.
  • namepattern
    requiredtrue

  • Parameter
    summaryObject

...

  • with options applicable for this entity

...

  • nameoptions
    typejson
    • Parameter
      summaryMinimum number of tokens the match must contains to be valid

...

    • . The default is the number of tokens contained in each pattern

...

    • .
      nameminTokens
      typeinteger
    • Parameter
      summaryMaximum number of tokens

...

    • the match must

...

    • contain to be valid

...

    • . The default is the number of tokens contained in each pattern

...

    • namemaxTokens
      typeinteger
    • Parameter
      summaryndicates

...

    • if the given tokens can be

...

    • matched in any order

...

    • as long as all appear in the match

...

    • . If false,

...

    • the tokens

...

    • must be in the order provided

...

  • This is the confidence of the entity, in comparison to all of the other entities. Essentially, the likelihood that this entity will be randomly encountered.

Other, Optional Fields

...

    • .
      defaulttrue
      namecombination

Include Page
Generic Resource Fields
Generic Resource Fields