The following are the token processors currently available:
Tokenizers
Token Filters
- Acronym Combiner - Converts initialized acronyms (e.g. N.A.S.A.) into combined acronyms (EG NASA)
- Case Filter - Converts tokens to any of lower-, upper-, or title-case
- Case Recorder - Sets flags to keep track of the cases of terms
- Character Change Splitter - Divides tokens when the character type changes
- Contains Filter - Not yet implemented - Filters all tokens containing a string or regular expression
- Flags Filter - Removes tokens which have particular flags set and/or are matches
- HTML Entity Decoder - Converts entities (e.g. >) into the actual characters (e.g. >)
- Lower Case Filter - Converts tokens to lower case
- Numbers Filter - Removes tokens which are entirely digits
- Paragraph Filter - Removes whole paragraphs, based on hash code
- Punctuation Filter - Removes tokens which are entirely punctuation
- Single Character Filter - Removes tokens that are only one character long
- Stop Words Filter - Removes tokens from a stop words list
- Tags Filter - Removes HTML/XML-type tags
- Token Length Filter - Not yet implemented - Filters all tokens less than a minimum length and/or greater than a maximum length
- Tokens And Pairs - Converts a token stream into a stream of tokens and token pairs
- Token Combinations - Converts a token stream into a stream of token pairs/triples/quads/etc.
- Type Filter - Removes all tokens of particular type(s), as specified by an Extractor
- Window Filter - Removes all tokens except those within particular windows
Token Statistics
- Count Characters - Counts a variety of character types across one or more documents
- Count Tokens - Counts tokens across one or more documents
- Gather Token Statistics - Computes a variety of statistics on all unique tokens processed all at once
- Hash Code - Computes a hash signature for a block of text
- Token Docs Histogram - Computes a histogram which counts the number of documents containing each unique token
- Tokens Histogram - Computes a histogram which counts the number of occurrences of all unique tokens across all documents
Document Scoring
Miscellaneous