Tokenization Library

The following are the token processors currently available:

Punctuation Tokenizer - Tokenize the input string on punctuation
Tags Tokenizer - Tokenize the input string by XML-style tags
Whitespace Token Breaker - Tokenize the input string on white space, recording the type of whitespace encountered
Whitespace Tokenizer - Tokenize the input string on white space (Lucene)
Other Lucene tokenizers

Acronym Combiner - Converts initialized acronyms (e.g. N.A.S.A.) into combined acronyms (EG NASA)
Case Filter - Converts tokens to any of lower-, upper-, or title-case
Case Recorder - Sets flags to keep track of the cases of terms
Character Change Splitter - Divides tokens when the character type changes
Contains Filter - Not yet implemented - Filters all tokens containing a string or regular expression
Flags Filter - Removes tokens which have particular flags set and/or are matches
HTML Entity Decoder - Converts entities (e.g. >) into the actual characters (e.g. >)
Lower Case Filter - Converts tokens to lower case
Numbers Filter - Removes tokens which are entirely digits
Paragraph Filter - Removes whole paragraphs, based on hash code
Punctuation Filter - Removes tokens which are entirely punctuation
Single Character Filter - Removes tokens that are only one character long
Stop Words Filter - Removes tokens from a stop words list
Tags Filter - Removes HTML/XML-type tags
Token Length Filter - Not yet implemented - Filters all tokens less than a minimum length and/or greater than a maximum length
Tokens And Pairs - Converts a token stream into a stream of tokens and token pairs
Token Combinations - Converts a token stream into a stream of token pairs/triples/quads/etc.
Type Filter - Removes all tokens of particular type(s), as specified by an Extractor
Window Filter - Removes all tokens except those within particular windows

Acronym Recognizer - Flags acronyms derived from local definitions
BNF Parser - A general Backus Normal Form pattern recognizer
Breaks Determiner - Explicitly flags line, paragraph, and sentence breaks
Company Name Extractor - Gathers full corporate entity names
Date Recognizer - Flags dates
Email Recognizer - Flags email addresses
Extracted Data Collector - Gathers extracted terms from extractor(s)
Extractor - Flags and collects tokens/phrases of interest
Mangler - Imitates the BNF Parser's mangleTerms function, so that it can happen later in the pipeline
Number Recognizer - Flags numbers
Phone Number Recognizer - Flags phone numbers
Tag Special Tokens - Flags a specific list of tokens, which often means tokens containing punctuation
Tags Recognizer - Flags HTML/XML-type tags
Time Recognizer - Flags times
URL Recognizer - Flags URLs

Count Characters - Counts a variety of character types across one or more documents
Count Tokens - Counts tokens across one or more documents
Gather Token Statistics - Computes a variety of statistics on all unique tokens processed all at once
Hash Code - Computes a hash signature for a block of text
Token Docs Histogram - Computes a histogram which counts the number of documents containing each unique token
Tokens Histogram - Computes a histogram which counts the number of occurrences of all unique tokens across all documents

Score Doc With Token Percentiles - Computes a total document score based on a sorted dictionary of "tokens of interest"

Page tree