Built-in Stages

This stages are contained inside the Saga Core library and available in all time

Text Block Readers

Readers read text streams and create text blocks to process.

Breakers read text blocks and breaks them into individual text blocks.

QuotationBreaker - Breaks TEXT_BLOCK tokens into other TEXT_BLOCK tokens, separating the non quoted text from the quoted one. This breaker respects the grammatical rules of quotes.

Tokenizers read text blocks and divide them up into individual tokens to be processed.

WhitespaceTokenizer - Splits text blocks into separate tokens on any number of white spaces.

Splitters split up tokens into multiple smaller tokens as an alternative interpretation.

CharacterSplitter - Splits tokens on specified characters, typically punctuation. Multiple split characters in a row will create a single split (not multiple splits).
CharChangeSplitter - Separates tokens based on character changes from lowercase-uppercase, letter-number, alphanumeric-punctuation. Without taking any character in the vertex, and respecting the capital letter.

Normalizers create alternative normalized interpretations of tokens from original tokens.

CaseAnalysis - Analyzes the case of every token (adds additional flags) and optionally adds a lower case normalized token to the interpretation graph.

Recognizers identify and flag tokens based on their character patterns.

NumberRecognizer - Identifies tokens that look like numbers and flags the tokens with the "NUMBER" flag.
StopWords - This Stage flags tokens that are matched to Stop-Words. The flagged tokens will be skipped in subsequent stages (if so indicated on the configuration).
Lemmatize - Match tokens to words in a dictionary then creates new LexItems for the token lemma and/or synonyms if configured.

Taggers create semantic tags which are added to the interpretation graph as alternative interpretations.

RegexPattern - Looks up matches to regular expressions in a dictionary across multiple tokens and then tags the match with one or more semantic tags as an alternative representation. For a simple regex expression where a match only needs to occur against a singe token, the Simple Regex Stage is recommended.
DictionaryTagger - Looks up sequences of tokens in a dictionary and then tags the sequence with one or more semantic tags as an alternative representation. Typically, these tags represent entities such as {person}, {place}, {company}, etc. This stage is also known as the "Entity Recognizer".
AdvancedPattern - Matches advanced recursive patterns of tokens and semantic tags. Pattern databases can be very large (millions) of entries.
Fragmentation - Identifies patterns with a combination of any number of specified tokens, regardless of the surrounding tokens.