This is a work in progress you can expect things to break while using this stage.
This stage review tokens using Elasticsearch suggestions functionality and creates a new token with a "suggestion" for word it does not recognize.The process takes all the available tokens (usually already tokenized by the "WhitespaceTokenizerStage") for the stage (using the highest confidence route), flags like "STOP_WORD" or "ALL_UPPER_CASE" can be used as filters by including them in the "Skip Flags" list.
Operates On: Lexical Items with TOKEN and possibly other flags as specified below.
This recognizer requires a dictionary to work, so it must be loaded either from a dataset or a file before using it. Validate your Elasticsearch version to ensure this stage is compatible.
{ "index": "saga_spellchecker_dictionary", "schema": "http", "host": "localhost", "port": "9200" }
V--------------[abraham lincoln likes makaroni and cheese]--------------------V ^--[abraham]--V--[lincoln]--V--[likes]--V--[makaroni]--V--[and]--V--[cheese]--^ ^--[macaroni]--^
No vertices are created in this stage
The data used by the dictionary may come from 2 sources:
Both options are accessed through Saga Server or the endpoints of the service. To create a dictionary from a dataset, select the one you are interested in and select the pipeline to process it, remember that the pipeline must end with a Spellchecker Stage. To create a dictionary from a file you only need a plain text file with terms separated by new line.
abraham lincoln likes macaroni and cheese