Bag of Words

Creates a bag of words / tfidf tag with the vector information for the document/text_block/sentence. Accumulates the vector until the engine cannot read any further

Uses Bag of Words Stage

Case Analysis

Analyzes the case of every token and optionally adds a lower case normalized token to the interpretation graph.

Uses Case Analysis Stage

Character Change Splitter

Separates tokens based on character changes from lowercase-uppercase, letter-number, alphanumeric-punctuation. Without taking any character in the vertex, and respecting the capital letter.

Uses Character Splitter Stage

Character Collapser

Creates a new token from the original but excluding the characters on the "Characters To Collapse" list.

Uses Character Collapser Stage

Character Splitter

Splits tokens on specified characters, typically punctuation. Multiple split characters in a row will create a single split (not multiple splits) when using the "Split as Vertex" setting.

Uses Character Splitter Stage

Language Detector

This processor uses Apache's OpenNLP Language Detector Model to identify and tag the original text with the tag format {LANG_<language_ISO>}.

This a plugin processor. Uses Language Detector Stage.

Lemmatize

Lemmatize tokens are matched to words in a dictionary.

Uses Lemmatize Stage

N-Gram

Creates N-Grams of TOKEN flagged lexical items. The size of the N-Gram can be specified by minimum and maximum settings. It will also break the N-Grams on SPLIT_FLAGS

Uses NGram Stage

Parts Of Speech

Part Of Speech tags a word in a text (corpus) as corresponding to a particular part of speech such as noun, verb, adjective, etc., based on its definition, as well as its context. Using OpenNLP (https://opennlp.apache.org/) and its POS Tagger

Uses Parts Of Speech Stage

Python Model

Connects directly to the Python Bridge, to send text or sections of the interpretation graph to be process by ML algorithms in Python

Uses Python Model Stage

Quotation Breaker

Breaks TEXT_BLOCK tokens into other TEXT_BLOCK tokens, separating the non quoted text from the quoted one. This breaker respects the grammatical rules of quotes.

Uses Quotation Breaker Stage

Remove Accents

Identifies accents in tokens then creates a new token without the accents, using the most similar letter as a replacement.

Uses Remove Accents Stage

Sentence Breaker (OpenNLP)

This processor is used to split text blocks by punctuation. The processor uses Apaches OpenNLP Sentence Detector to identify punctuation character marks to define the end of a sentence.

This is a plugin processor. Uses Sentence Breaker Stage.

Sentence Breaker (Text Breaker)

Breaks a text block into sentences using the BreakIterator of java. This is used to break sentences using punctuation delimiters. Delimiters can be configured as "breakers".

Uses Text Breaker Stage

Sentence Filter

This stage flags vertices with “Skip-Sentence”. The vertex flag is the start of the sentence. This can be used to ignore a complete sentence by a later stage.

Uses Sentence Filter Stage

The conditions evaluated by the processor are:

Sentence length, given by the token count, not vertices.
A list of tags that work as an exception to the count, meaning that if the tag is found within the sentence the count is irrelevant and the sentence is not flagged (allow listing).
A list of tags that if found in the sentence it should be flagged (deny listing).

Deny listing a tag always has precedence over the other values, so any sentence with a deny listed flag will always be flagged as “SKIP_SENTENCE”. Allow listed tags will always have precedence over the token limit restriction. And finally token limit restriction is on effect.

Sentence Filter will flag the initial vertex of the sentence with a "SKIP_SENTENCE" flag, it will not remove the sentence from the interpretation graph.

Stop Words

This stage flags tokens as Stop-Words when they match any entry in the stop words dictionary. The flagged tokens will be skipped in subsequent stages (if so indicated on the configuration).

Uses Stop Words Stage

Synonyms

Detects synonyms previously specified then generates a normalize token with the synonym.

Uses Synonym Stage

Whitespace Tokenizer

Splits the text into separate tokens on any number of white spaces.

Uses Whitespace Tokenizer Stage

Page tree