You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 20 Next »

These stages are contained inside the Saga Core library and are available at all times.

Text Block Readers

Readers read text streams and create text blocks to process.

  • Simple Reader - Reads a text stream and outputs a list of text blocks.

Text Block Breakers

Breakers read text blocks and break them into individual text blocks.

  • Quotation Breaker - Breaks TEXT_BLOCK tokens into other TEXT_BLOCK tokens, separating the non quoted text from the quoted one. This breaker respects the grammatical rules of quotes.
  • Text Breaker StageBreaks a text block into sentences using the BreakIterator of java. This is used to break sentences using punctuation delimiters.  Delimiters can be configured as "breakers".

Tokenizers

Tokenizers read text blocks and divide them up into individual tokens to be processed.

Splitters

Splitters split up tokens into multiple smaller tokens as an alternative interpretation.

  • Character Splitter - Splits tokens on specified characters, typically punctuation. Multiple split characters in a row will create a single split (not multiple splits).

  • Char Change Splitter - Separates tokens based on character changes from lowercase-uppercase, letter-number, alphanumeric-punctuation. Without taking any character in the vertex, and respecting the capital letter.

Collapsers

Collapsers reduce tokens into simpler smaller tokens as an alternative interpretation.

  • Character Collapser - Creates a new token from the original but excluding the characters on the "characters" list parameter.

Normalizers

Normalizers create alternative normalized interpretations of tokens from original tokens.

  • Case Analysis - Analyzes the case of every token (adds additional flags) and optionally adds a lower case normalized token to the interpretation graph.
  • Synonym - Detects synonyms specified and generates a normalize token

Recognizers

Recognizers identify and flag tokens based on their character patterns.

  • Number Recognizer - Identifies tokens that look like numbers and flags the tokens with the "NUMBER" flag.
  • Stop Words - This Stage flags tokens that are matched to Stop-Words. The flagged tokens will be skipped in subsequent stages (if so indicated on the configuration). 
  • Lemmatize - Match tokens to words in a dictionary then creates new LexItems for the token lemma and/or synonyms if configured.
  • ABA

    Error rendering macro 'excerpt-include'

    No link could be created for 'ABA Stage'.

  • BIC

    Error rendering macro 'excerpt-include'

    No link could be created for 'BIC Stage'.

  • Date Time Identifies tokens that look like dates or time indicators and flags them with the "DATE" flag.

Taggers

Taggers create semantic tags which are added to the interpretation graph as alternative interpretations.

  • Regex PatternLooks up matches to regular expressions in a dictionary across multiple tokens and then tags the match with one or more semantic tags as an alternative representation. For a simple regex expression where a match only needs to occur against a singe token, the Simple Regex Stage is recommended.
  • Dictionary Tagger - Looks up sequences of tokens in a dictionary and then tags the sequence with one or more semantic tags as an alternative representation. Typically, these tags represent entities such as {person}, {place}, {company}, etc.  This stage is also known as the "Entity Recognizer".
  • Advanced Pattern - Matches advanced recursive patterns of tokens and semantic tags. Pattern databases can be very large (millions) of entries.
  • Fragmentation - Identifies patterns with a combination of any number of specified tokens, regardless of the surrounding tokens.

Transformers

Transformers generates tags, not of semantic nature, but with new data for later use

  • Bag Of Words- Creates a bag of words / tfidf tag with the vector information for the document/text_block/sentence. Accumulates the vector until the engine cannot read any further
  • Best Bets - This stage maintains a list tokens used to identify possible subjects of interest and suggest a URL reference along with "title" and "description". The title and description fields are used as display data.

Producers

Producers create consumable output based on the processed graph.

  • Json Producer - The JSON Producer Stage will, as its name suggests, produces a JSON array representation of TEXT_BLOCK items. Output can be filtered to entities only or to all tokens. Access to the produced output is done programatically (see below).
  • Markup Producer - The Markup Producer Stage will produce strings of TEXT_BLOCK with normalized tags based on the graph's path with highest confidence.


  • No labels