Creates a bag of words / tfidf tag with the vector information for the document/text_block/sentence. Accumulates the vector until the engine cannot read any further
Uses Bag of Words Stage
Analyzes the case of every token and optionally adds a lower case normalized token to the interpretation graph.
Uses Case Analysis Stage
Separates tokens based on character changes from lowercase-uppercase, letter-number, alphanumeric-punctuation. Without taking any character in the vertex, and respecting the capital letter.
Creates a new token from the original but excluding the characters on the "Characters To Collapse" list.
Splits tokens on specified characters, typically punctuation. Multiple split characters in a row will create a single split (not multiple splits) when using the "Split as Vertex" setting.
This processor uses Apache's OpenNLP Language Detector Model to identify and tag the original text with the tag format {LANG_<language_ISO>}.
This a plugin processor. Uses Language Detector Stage.
Lemmatize tokens are matched to words in a dictionary.
Uses Lemmatize Stage
Uses NGram Stage
Part Of Speech tags a word in a text (corpus) as corresponding to a particular part of speech such as noun, verb, adjective, etc., based on its definition, as well as its context. Using OpenNLP (https://opennlp.apache.org/) and its POS Tagger
Connects directly to the Python Bridge, to send text or sections of the interpretation graph to be process by ML algorithms in Python
Uses Python Model Stage
Breaks TEXT_BLOCK tokens into other TEXT_BLOCK tokens, separating the non quoted text from the quoted one. This breaker respects the grammatical rules of quotes.
Uses Remove Accents Stage
This processor is used to split text blocks by punctuation. The processor uses Apaches OpenNLP Sentence Detector to identify punctuation character marks to define the end of a sentence.
This is a plugin processor. Uses Sentence Breaker Stage.
Breaks a text block into sentences using the BreakIterator of java. This is used to break sentences using punctuation delimiters. Delimiters can be configured as "breakers".
Uses Text Breaker Stage
This stage flags vertices with “Skip-Sentence”. The vertex flag is the start of the sentence. This can be used to ignore a complete sentence by a later stage.
The conditions evaluated by the processor are:
Deny listing a tag always has precedence over the other values, so any sentence with a deny listed flag will always be flagged as “SKIP_SENTENCE”. Allow listed tags will always have precedence over the token limit restriction. And finally token limit restriction is on effect.
Sentence Filter will flag the initial vertex of the sentence with a "SKIP_SENTENCE" flag, it will not remove the sentence from the interpretation graph.
This stage flags tokens as Stop-Words when they match any entry in the stop words dictionary. The flagged tokens will be skipped in subsequent stages (if so indicated on the configuration).
Uses Stop Words Stage
Detects synonyms previously specified then generates a normalize token with the synonym.
Uses Synonym Stage
Splits the text into separate tokens on any number of white spaces.