Page History

Versions Compared

Key

This line was added.
This line was removed.
Formatting was changed.

Excerpt
Splits text blocks into separate tokens on any number of white space.

Operates On: Lexical Items with TEXT_BLOCK

Include Page

	Generic Configuration Parameters
	Generic Configuration Parameters

Code Block

language	js
theme	Eclipse
title	Example Configuration

{
 "type":"WhitespaceTokenizer"
}

Output Flags

Lex-Item Flags:

TOKEN - Identifies that the Lex-Items produced by this stage are tokens and not text blocks.
ORIGINAL - Identifies that the Lex-Items produced by this stage are the original, as written, representation of every token (e.g. before normalization)
HAS_DIGIT - Tokens produced with at least one digit character are tagged as HAS_DIGIT
HAS_PUNCTUATION - Tokens produced with at least one punctuation character are tagged as HAS_PUNCTUATION.

Vertex Flags:

ALL_WHITESPACE - Identifies that the characters spanned by the vertex are all whitespace characters (spaces, tabs, new-lines, carriage returns, etc.)

Example

Code Block

language	text
theme	FadeToGrey

V-----------[This is a sentenceA.]-----------V----------------------[  and this is a sentence with leading whitespace.]----------------------V 
^--[This]--V--[is]--V--[a]--V--[sentenceA.]--^--[and]--V--[this]--V--[is]--V--[a]--V--[sentence]--V--[with]--V--[leading]--V--[whitespace.]--^