Bag of Words

Creates a bag of words / tfidf tag with the vector information for the document/text_block/sentence. Accumulates the vector until the engine cannot read any further

Case Analysis

Analyzes the case of every token and optionally adds a lower case normalized token to the interpretation graph.

Character Change Splitter

Separates tokens based on character changes from lowercase-uppercase, letter-number, alphanumeric-punctuation. Without taking any character in the vertex, and respecting the capital letter.

Character Collapser

Creates a new token from the original but excluding the characters on the "Characters To Collapse" list.

Character Splitter

Splits tokens on specified characters, typically punctuation. Multiple split characters in a row will create a single split (not multiple splits) when using the "Split as Vertex" setting.

Language Detector

This processor uses Apache's OpenNLP Language Detector Model to identify and tag the original text with the tag format {LANG_<language_ISO>}.

This a plugin processor. Uses Language Detector Stage.

Lemmatize

Lemmatize tokens are matched to words in a dictionary.

N-Gram

Creates N-Grams of TOKEN flagged lexical items.  The size of the N-Gram can be specified by minimum and maximum settings. It will also break the N-Grams on SPLIT_FLAGS


Parts Of Speech

Part Of Speech tags a word in a text (corpus) as corresponding to a particular part of speech such as noun, verb, adjective, etc., based on its definition, as well as its context. Using OpenNLP (https://opennlp.apache.org/) and its POS Tagger

Python Model

Connects directly to the Python Bridge, to send text or sections of the interpretation graph to be process by ML algorithms in Python

Quotation Breaker

Breaks TEXT_BLOCK tokens into other TEXT_BLOCK tokens, separating the non quoted text from the quoted one. This breaker respects the grammatical rules of quotes.

Remove Accents

Identifies accents in tokens then creates a new token without the accents, using the most similar letter as a replacement.


Sentence Breaker (OpenNLP)

This processor is used to split text blocks by punctuation.  The processor uses Apaches OpenNLP Sentence Detector to identify punctuation character marks to define the end of a sentence.

This is a plugin processor. Uses Sentence Breaker Stage.

Sentence Breaker (Text Breaker)

Breaks a text block into sentences using the BreakIterator of java. This is used to break sentences using punctuation delimiters.  Delimiters can be configured as "breakers".

Sentence Filter

This stage flags vertices with “Skip-Sentence”.  The vertex flag is the start of the sentence. This can be used to ignore a complete sentence by a later stage.

The conditions evaluated by the processor are:

  • Sentence length, given by the token count, not vertices.
  • A list of tags that work as an exception to the count, meaning that if the tag is found within the sentence the count is irrelevant and the sentence is not flagged (allow listing).
  • A list of tags that if found in the sentence it should be flagged (deny listing).

Deny listing a tag always has precedence over the other values, so any sentence with a deny listed flag will always be flagged as “SKIP_SENTENCE”.  Allow listed tags will always have precedence over the token limit restriction. And finally token limit restriction is on effect.

Sentence Filter will flag the initial vertex of the sentence with a "SKIP_SENTENCE" flag, it will not remove the sentence from the interpretation graph.

Stop Words

This stage flags tokens as Stop-Words when they match any entry in the stop words dictionary. The flagged tokens will be skipped in subsequent stages (if so indicated on the configuration). 

Synonyms

Detects synonyms previously specified then generates a normalize token with the synonym.

Whitespace Tokenizer

Splits the text into separate tokens on any number of white spaces.

  • No labels