WhitespaceTokenizer Stage

Created by Paul E. Nelson, last modified by Potter (Esteban Alvarado) on Aug 21, 2017

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Spits text blocks into separate tokens on any number of white space.

Operates On: Lexical Items with TEXT_BLOCK

Configuration

Example Configuration

{
 "type":"WhitespaceTokenizer"
}

Output Flags

Lex-Item Flags:

TOKEN - Identifies that the Lex-Items produced by this stage are tokens and not text blocks.
ORIGINAL - Identifies that the Lex-Items produced by this stage are the original, as written, representation of every token (e.g. before normalization)

Vertex Flags:

ALL_WHITESPACE - Identifies that the characters spanned by the vertex are all whitespace characters (spaces, tabs, new-lines, carriage returns, etc.)

Example

V-----------[This is a sentenceA.]-----------V----------------------[  and this is a sentence with leading whitespace.]----------------------V 
^--[This]--V--[is]--V--[a]--V--[sentenceA.]--^--[and]--V--[this]--V--[is]--V--[a]--V--[sentence]--V--[with]--V--[leading]--V--[whitespace.]--^

No labels