WhitespaceTokenizer Stage

Created by Paul E. Nelson, last modified by Potter (Esteban Alvarado) on Jan 29, 2018

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Splits text blocks into separate tokens on any number of white space.

Operates On: Lexical Items with TEXT_BLOCK

Configuration Parameters

skipFlags (string array, optional) - Flags to be skipped by this stage
- Tokens marked with this flags will be ignore by this stage, and no process will be performed.
requiredFlags (string array, optional)
- Tokens need to have all the specified flags, in order to be processed
debug (boolean, optional)
- Enable all debug log functionality of the stage, if any.

Example Configuration

{
 "type":"WhitespaceTokenizer"
}

Output Flags

Lex-Item Flags:

TOKEN - Identifies that the Lex-Items produced by this stage are tokens and not text blocks.
ORIGINAL - Identifies that the Lex-Items produced by this stage are the original, as written, representation of every token (e.g. before normalization)
HAS_DIGIT - Tokens produced with at least one digit character are tagged as HAS_DIGIT
HAS_PUNCTUATION - Tokens produced with at least one punctuation character are tagged as HAS_PUNCTUATION.

Vertex Flags:

ALL_WHITESPACE - Identifies that the characters spanned by the vertex are all whitespace characters (spaces, tabs, new-lines, carriage returns, etc.)

Example

V-----------[This is a sentenceA.]-----------V----------------------[  and this is a sentence with leading whitespace.]----------------------V 
^--[This]--V--[is]--V--[a]--V--[sentenceA.]--^--[and]--V--[this]--V--[is]--V--[a]--V--[sentence]--V--[with]--V--[leading]--V--[whitespace.]--^

No labels