You are viewing an old version of this page. View the current version.
Compare with Current
View Page History
Version 1
Next »
Splits tokens on specified characters, typically punctuation.
Operates On: Lexical Items with TOKEN
Configuration Parameters
- splitChars (optional) - List of characters which should be used to split tokens.
- Note 1 or more of these characters in sequence will be a single split.
- If not present, then tokens are split on any sequence of punctuation.
- dontSplitChars (optional) - List of characters which will NOT be used to split tokens.
- This is typically used to identify exceptions (characters which are not used to split tokens) when splitChars is not present.
- dontSplitChars will be added to the appropriate token to which they belong (they are token characters)
- splitFlag (optional) - The flag to be put on the vertex between the two tokens
- Defaults to ALL_PUNCTUATION
Splits on all punctuation, except periods.
For example, the token: "SagaToolkit-1.0" will produce the following graph:
V-------[SagaToolkit-1.0]-------V
^---[SagaToolkit]--V--[1.0]----^
(split blocks on two new-lines)
Flags
Lex-Item Flags:
- TEXT_BLOCK - Flags all text blocks produced by the SimpleReader
Vertex Flags:
- TEXT_BLOCK_SPLIT - Identifies the vertex as a split between text blocks.
- OVERFLOW_SPLIT - Identifies that an entire buffer was read without finding a split between text blocks.
- The current maximum size of a text block is 64K characters.
- Text blocks larger than this will be arbitrarily split, and the vertex will be marked with "OVERFLOW_SPLIT"\