Stages can be linked together into language processing pipelines that process text and create interpretation graphs.  See Pipelines and Pipeline Configuration for more details.


Built-in Stages


These stages are contained inside the Saga Core library and are available at all times.

Text Block Readers

Readers read text streams and create text blocks to process.

  • Simple Reader - Reads a text stream and outputs a list of text blocks.

Text Block Breakers

Breakers read text blocks and break them into individual text blocks.

  • Quotation Breaker - Breaks TEXT_BLOCK tokens into other TEXT_BLOCK tokens, separating the non quoted text from the quoted one. This breaker respects the grammatical rules of quotes.
  • Text Breaker StageBreaks a text block into sentences using the BreakIterator of java. This is used to break sentences using punctuation delimiters.  Delimiters can be configured as "breakers".

Tokenizers

Tokenizers read text blocks and divide them up into individual tokens to be processed.

Splitters

Splitters split up tokens into multiple smaller tokens as an alternative interpretation.

  • Character Splitter - Splits tokens on specified characters, typically punctuation. Multiple split characters in a row will create a single split (not multiple splits).

  • Char Change Splitter - Separates tokens based on character changes from lowercase-uppercase, letter-number, alphanumeric-punctuation. Without taking any character in the vertex, and respecting the capital letter.

Collapsers

Collapsers reduce tokens into simpler smaller tokens as an alternative interpretation.

  • Character Collapser - Creates a new token from the original but excluding the characters on the "characters" list parameter.

Normalizers

Normalizers create alternative normalized interpretations of tokens from original tokens.

  • Case Analysis - Analyzes the case of every token (adds additional flags) and optionally adds a lower case normalized token to the interpretation graph.
  • Synonym - Detects synonyms specified and generates a normalize token
  • Remove Accents - Identifies accents in tokens and creates a new token without the accents, using the most similar letter as a replacement.

Recognizers

Recognizers identify and flag tokens based on their character patterns.

  • Number Recognizer - Identifies tokens that look like numbers and flags the tokens with the "NUMBER" flag.
  • Stop Words - This Stage flags tokens that are matched to Stop-Words. The flagged tokens will be skipped in subsequent stages (if so indicated on the configuration). 
  • Lemmatize - Match tokens to words in a dictionary then creates new LexItems for the token lemma and/or synonyms if configured.
  • Synonym Stage - Detects synonyms specified and generates a normalize token
  • ABA  Recognizer - Implements an entity extractor for ABA (American Bankers Association) routing transit numbers (RTNs). ABA RTNs are only for use in payment transactions within the United States. They are used on paper check, wire transfers, and ACH transactions.
  • BIC Recognizer - Implements an entity extractor for Bank/Business Identifier Codes. These codes are assigned to each bank and/or business in every country and are administered by the Society for Worldwide Interbank Financial Telecommunication (SWIFT).
  • IBAN Recognizer -  Implements an entity extractor for International Bank Account Numbers. These codes are assigned to individual bank accounts (mostly EU, Middle East, & Caribbean).
  • Date Time Recognizer Identifies tokens that look like dates or time indicators and flags them with the "DATE" flag.
  • Email Recognizer Identifies tokens that look like emails and flags them with the "EMAIL" flag.
  • Phone Number - This stage identifies tokens that look like phone numbers and flag them as "PHONE".
  • Postal Code - This stage identifies tokens that look like postal codes and flag them as "POSTCODE". The stage support codes up to 5 characters length, so this includes US, UK and Canada, as well as any other country with similar standards.
  • URL Recognizer - This stage identifies tokens that looks like URL addresses and flag them as "URL".
  • Federal Recognizer - Detect federal identifications such as U.S. SSN, Canada SIN, UK NINo and Costa Rica cédula
  • IP Address RecognizerIdentifies Internet Protocol address (IP) of version v4 and v6
  • Latitude Longitude Recognizer - Identifies latitude and longitude including the cardinal direction
  • MAC Address Recognizer - Identifies media access control address (MAC address). MAC addresses are recognizable as six groups of two hexadecimal digits, separated by hyphens, colons, or without a separator.
  • MAID Recognizer - Identifies the format for Global Device Advertising Identifiers (i.e. iDFA, GAID, Roku ID) used in the digital advertising ecosystem.
  • Credit Card Recognizer - Identifies credit card adding as part of the metadata the issuer

Taggers

Taggers create semantic tags which are added to the interpretation graph as alternative interpretations.

  • Regex PatternLooks up matches to regular expressions in a dictionary across multiple tokens and then tags the match with one or more semantic tags as an alternative representation. For a simple regex expression where a match only needs to occur against a singe token, the Simple Regex Stage is recommended.
  • Simple Regex -  Reads a text stream and outputs a list of text blocks.
  • Dictionary Tagger - Looks up sequences of tokens in a dictionary and then tags the sequence with one or more semantic tags as an alternative representation. Typically, these tags represent entities such as {person}, {place}, {company}, etc.  This stage is also known as the "Entity Recognizer".
  • Advanced Pattern - Matches advanced recursive patterns of tokens and semantic tags. Pattern databases can be very large (millions) of entries.
  • Fragmentation - Identifies patterns with a combination of any number of specified tokens, regardless of the surrounding tokens.
  • GeoNames -  Identifies geo locations, based on the patterns loaded.
  • Token Matcher - This stage works in a similar way to the Dictionary Tagger stage in the sense that looks up sequences of tokens in a dictionary to match the text being processed. The difference is that it will also include in the matching text N tokens to the right and/or left of the original matched text. 

Transformers

Transformers generates tags, not of semantic nature, but with new data for later use

  • Bag Of Words- Creates a bag of words / tfidf tag with the vector information for the document/text_block/sentence. Accumulates the vector until the engine cannot read any further
  • Best Bets - This stage maintains a list tokens used to identify possible subjects of interest and suggest a URL reference along with "title" and "description". The title and description fields are used as display data.
  • NGram - Creates n-grams from TOKEN lex items of size min-max. Breaks n-grams on SPLIT_FLAGS

Python

Producers

Producers create consumable output based on the processed graph.

  • Json Producer - The JSON Producer Stage will, as its name suggests, produces a JSON array representation of TEXT_BLOCK items. Output can be filtered to entities only or to all tokens. Access to the produced output is done programatically (see below).
  • Markup Producer - The Markup Producer Stage will produce strings of TEXT_BLOCK with normalized tags based on the graph's path with highest confidence.

Filters

Mark vertices or tokens to be skipped by other stages. 

  • Sentence Filter Producer - This stage flags vertices with “Skip-Sentence”.  The vertex flag is the start of the sentence. This can be used to ignore a complete sentence by a later stage.



Add-on Stages


These stages are external libraries to the Saga Core library, and need to be added as dependencies to your application.

Text Block Breakers

Breakers read text blocks and break them into smaller text blocks.

  • Sentence Breaker - Breaks a text block into sentences, using OpenNLP Sentence Detector

Recognizers

Recognizers identify and flag tokens based on their character patterns.

  • Parts Of Speech - Part Of Speech tags a word in a text (corpus) as corresponding to a particular part of speech such as noun, verb, adjective, etc., based on its definition, as well as its context. Using OpenNLP (https://opennlp.apache.org/) and its POS Tagger

Spell Checkers

Spell checkers process specific tokens to identify misspellings and add alternatives to the interpretation graph.

  • Spellchecker - This stage review tokens using Elasticsearch suggestions functionality and creates a new token with a "suggestion" for word it does not recognize.

Language Detectors

Language detectors use OpenNLP (https://opennlp.apache.org/) and its language detector model to identify the language of a text block.

Machine Learning

These stages load a ML model and evaluate input text through Saga.

  • Name Entity Recognizer - The name predictor stage uses OpenNLP's NameFinder to load Name Entity Recognizer models and tag tokens that match entities based on the model given a certain threshold of accuracy.
  • Sentence Classifier - The sentence classifier stage uses OpenNLP's DocumentCategorizer to load classification models and tag sentences that match the binary classification model (is or isn't in a certain category) given a specified threshold of accuracy.
  • FAQ - The FAQ stage does a semantic comparison of a sentence against questions and its respective answer (using TensorFlow), if the confidence value is in the threshold, it will create a tag holding the question and answer.