The following are the token processors currently available:
Tokenizers
- Punctuation Tokenizer - Tokenize the input string on punctuation
- Tags Tokenizer - Tokenize the input string by XML-style tags
- Whitespace Token Breaker - Tokenize the input string on white space, recording the type of whitespace encountered
- Whitespace Tokenizer - Tokenize the input string on white space (Lucene)
- Other Lucene tokenizers
Token Filters
- Acronym Combiner - Converts initialized acronyms (e.g. N.A.S.A.) into combined acronyms (EG NASA)
- Case Filter - Converts tokens to any of lower-, upper-, or title-case
- Case Recorder - Sets flags to keep track of the cases of terms
- Character Change Splitter - Divides tokens when the character type changes
- Contains Filter - Not yet implemented - Filters all tokens containing a string or regular expression
- Flags Filter - Removes tokens which have particular flags set and/or are matches
- HTML Entity Decoder - Converts entities (e.g. >) into the actual characters (e.g. >)
- Lower Case Filter - Converts tokens to lower case
- Numbers Filter - Removes tokens which are entirely digits
- Paragraph Filter - Removes whole paragraphs, based on hash code
- Punctuation Filter - Removes tokens which are entirely punctuation
- Single Character Filter - Removes tokens that are only one character long
- Stop Words Filter - Removes tokens from a stop words list
- Tags Filter - Removes HTML/XML-type tags
- Token Length Filter - Not yet implemented - Filters all tokens less than a minimum length and/or greater than a maximum length
- Tokens And Pairs - Converts a token stream into a stream of tokens and token pairs
- Token Combinations - Converts a token stream into a stream of token pairs/triples/quads/etc.
- Type Filter - Removes all tokens of particular type(s), as specified by an Extractor
- Window Filter - Removes all tokens except those within particular windows
Entity Extraction
- Acronym Recognizer - Flags acronyms derived from local definitions
- BNF Parser - A general Backus Normal Form pattern recognizer
- Breaks Determiner - Explicitly flags line, paragraph, and sentence breaks
- Company Name Extractor - Gathers full corporate entity names
- Date Recognizer - Flags dates
- Email Recognizer - Flags email addresses
- Extracted Data Collector - Gathers extracted terms from extractor(s)
- Extractor - Flags and collects tokens/phrases of interest
- Extractor (Redis-based) - Flags and collects tokens/phrases of interest
- Mangler - Imitates the BNF Parser's mangleTerms function, so that it can happen later in the pipeline
- Number Recognizer - Flags numbers
- Phone Number Recognizer - Flags phone numbers
- Tag Special Tokens - Flags a specific list of tokens, which often means tokens containing punctuation
- Tags Recognizer - Flags HTML/XML-type tags
- Time Recognizer - Flags times
- URL Recognizer - Flags URLs
Token Statistics
- Count Characters - Counts a variety of character types across one or more documents
- Count Tokens - Counts tokens across one or more documents
- Gather Token Statistics - Computes a variety of statistics on all unique tokens processed all at once
- Hash Code - Computes a hash signature for a block of text
- Token Docs Histogram - Computes a histogram which counts the number of documents containing each unique token
- Tokens Histogram - Computes a histogram which counts the number of occurrences of all unique tokens across all documents
Document Scoring
- Score Doc With Token Percentiles - Computes a total document score based on a sorted dictionary of "tokens of interest"
Miscellaneous
- Concatenate Tokens - Concatenates all tokens together into a single big string
- Display Tokens - Show all info about every token to stdout
Overview
Content Tools