WhitespaceTokenizer Stage

Created by Paul E. Nelson, last modified by Potter (Esteban Alvarado) on Oct 26, 2017

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Spits text blocks into separate tokens on any number of white space.

Operates On: Lexical Items with TEXT_BLOCK

Configuration

Example Configuration

{
 "type":"WhitespaceTokenizer"
}

Output Flags

Lex-Item Flags:

TOKEN - Identifies that the Lex-Items produced by this stage are tokens and not text blocks.
ORIGINAL - Identifies that the Lex-Items produced by this stage are the original, as written, representation of every token (e.g. before normalization)
HAS_DIGIT - Tokens produced with at least one digit character are tagged as HAS_DIGIT
HAS_PUNCTUATION - Tokens produced with at least one punctuation character are tagged as HAS_PUNCTUATION.

Vertex Flags:

ALL_WHITESPACE - Identifies that the characters spanned by the vertex are all whitespace characters (spaces, tabs, new-lines, carriage returns, etc.)

Example

V-----------[This is a sentenceA.]-----------V----------------------[  and this is a sentence with leading whitespace.]----------------------V 
^--[This]--V--[is]--V--[a]--V--[sentenceA.]--^--[and]--V--[this]--V--[is]--V--[a]--V--[sentence]--V--[with]--V--[leading]--V--[whitespace.]--^

No labels