Pipeline Stages

Text Block Readers

Readers read text streams and create text blocks to process.

Tokenizers read text blocks and divide them up into individual tokens to be processed.

Splitters split up tokens into multiple smaller tokens as an alternative interpretation.

CharacterSplitter - Tokens are split when any in a specified set of characters (typically punctuation) is encountered.

Normalizers create alternative normalized interpretations of tokens from original tokens.

CaseAnalysis - Analyzes and flags the case of tokens and then (optionally) normalizes the token to lower case.

Recognizers identify and flag tokens based on their character patterns.

NumberRecognizer Stage - Identifies tokens which look like numbers and flags them with the "NUMBER" flag.