Page History

Warning
This is a work in progress you can expect things to break while using this stage.

Excerpt
This stage review tokens using Elasticsearch suggestions functionality and creates a new token with a "suggestion" for word it does not recognize.

The process takes all the available tokens (usually already tokenized by the "WhitespaceTokenizerStage") for the stage (using the highest confidence route), flags like "STOP_WORD" or "ALL_UPPER_CASE" can be used as filters by including them in the "Skip Flags" list.

Operates On: Lexical Items with TOKEN and possibly other flags as specified below.

Saga_is_recognizer

Note
This recognizer requires a dictionary to work, so it must be loaded either from a dataset or a file before using it. Validate your Elasticsearch version to ensure this stage is compatible.

Include Page

	Generic Configuration Parameters
	Generic Configuration Parameters

Configuration Parameters

Parameter
summary Index used by the stage to store dictionary data.
default spellcheck_dictionary
name index
- This is an Elasticsearch index.
Parameter
summary Schema used by Elasticsearch connection
default http
name schema
Parameter
summary Hostnamedescription
default cheeselocalhost
name host
Parameter
Explanation
summary Port
default 9200
name port
type integer

Saga_config_stage

boundaryFlags	text block split

"parameter"index": "saga_spellchecker_dictionary",
"schema": "http",
"host": "something something"localhost",
"port": "9200"

Example Output

Saga_graph

V--------------[abraham lincoln likes macaronimakaroni and cheese]--------------------V
^--[abraham]--V--[lincoln]--V--[likes]--V--[macaronimakaroni]--V--[and]--V--[cheese]--^
              ^---{place}---^           ^----{food}----^         ^---{food}---^
^----------{person}---------^           ^-----------------{food}--------------[macaroni]--^

Output Flags

Lex-Item Flags:

MISSPELL- Identifies a token as potential misspelling.
SUGGESTION - Added to the newly created token to identify it as a generated token and coming from the dictionary.
SEMANTIC_TAG - Identifies all lexical items which are semantic tags.
PROCESSED - Placed on all the tokens which composed the semantic tag.
ALL_LOWER_CASE - All of the characters in the token are lower-case characters.
ALL_UPPER_CASE - All of the characters in the token are upper-case characters (for example, acronyms).
ALL_DIGITS - All of the characters in the token are digits (0-9)
TITLE_CASE - The first character is upper case, all of the other characters are lower case.
MIXED_CASE - Handles any mixed upper & lower case scenario not covered above.
TOKEN - All tokens produced are tagged as TOKEN
CHAR_CHANGE - Identifies the vertex as a change between character formats
HAS_DIGIT - Tokens produced with at least one digit character are tagged as HAS_DIGIT
HAS_PUNCTUATION - Tokens produced with at least one punctuation character are tagged as HAS_PUNCTUATION. (ALL_PUNCTUATION will not be tagged as HAS_PUNCTUATION)
LEMMATIZE- All words retrieved will be marked as LEMMATIZE
NUMBER - Flagged on all tokens which are numbers according to the rules above.
TEXT_BLOCK - Flags all text blocks.
STOP_WORD- All matched stop words will be marked as STOP_WORD
WEIGHT_VECTOR - Identifies the tag as a weight vector representation of a sentence
BANK- Identifies a Bank account number.
ABA- Account number with ABA format.
BIC- Account number with BIC format.
IBAN- Account number with IBAN format.
ORIGINAL - Identifies that the Lex-Items produced by this stage are the original, as written, representation of every token (e.g. before normalization)
SSN - Identifies a Federal ID number
GEONAME - Identifies a geographical location name

Vertex Flags:

Info
No vertices are created in this stage

ALL_PUNCTUATION - Identifies the vertex as all token
- The default flag if no "splitFlag" is present.
<splitFlag> - Defines an alternative flag to ALL_PUNCTUATION, if desired (see above)
CHAR_CHANGE - Identifies the vertex as a change between character formats
TEXT_BLOCK_SPLIT - Identifies the vertex as a split between text blocks.
OVERFLOW_SPLIT - Identifies that an entire buffer was read without finding a split between text blocks.
- The current maximum size of a text block is 64K characters.
- Text blocks larger than this will be arbitrarily split, and the vertex will be marked with "OVERFLOW_SPLIT"\
ALL_WHITESPACE - Identifies that the characters spanned by the vertex are all whitespace characters (spaces, tabs, new-lines, carriage returns, etc.)

Resource Data

The data used by the dictionary may come from 2 sources:

Dataset
Plain text file

Both options are accessed through Saga Server or the endpoints of the service. To create a dictionary from a dataset, select the one you are interested in and select the pipeline to process it, remember that the pipeline must end with a Spellchecker Stage. To create a dictionary from a file you only need a plain text file with terms separated by new lineDescription of resource.

Resource Format

Saga_json

Title	Entity Json Format

"_id" : "KGAAJGsBemSwA0nZTLXA",
"tag": "recipe",
"pattern": "("how many"|"how much") {ingredient} ",
"confAdjust": 0.95

. . . additional fields as needed go here . . .

Note
Multiple entries can have the same pattern. If the pattern is matched, then it will be tagged with multiple (ambiguous) entry IDs. Additional fielded data can be added to the record; as needed by downstream processes.

Fields

Parameter
summary What to show the user when browsing this entity
name display
required true
Parameter
summary Tag which will identify any match in the graph, as an interpretation
name tag
required true
- These will all be added to the interpretation graph with the SEMANTIC_TAG flag.
  Tip
  Tags are hierarchical representations of the same intent. For example, {city} → {administrative-area} → {geographical-area}
Parameter
summary Pattern to match in the content
name pattern
required true

Dictionary Plain Text File

abraham
lincoln
likes
macaroni
and
cheese

Include PageGeneric Resource FieldsGeneric Resource Fields

Page tree

Versions Compared

Old Version 1

New Version 2

Key

Configuration Parameters

Example Output

Output Flags

Lex-Item Flags:

Vertex Flags:

Resource Data

Resource Data

Resource Format

Fields