CharacterSplitter Stage

Splits tokens on specified characters, typically punctuation.

Operates On: Lexical Items with TOKEN

Configuration Parameters

splitChars (optional) - List of characters which should be used to split tokens.
- Note 1 or more of these characters in sequence will be a single split.
- If not present, then tokens are split on any sequence of punctuation.
dontSplitChars (optional) - List of characters which will NOT be used to split tokens.
- This is typically used to identify exceptions (characters which are not used to split tokens) when splitChars is not present.
- dontSplitChars will be added to the appropriate token to which they belong (they are token characters)
splitFlag (optional) - The flag to be put on the vertex between the two tokens
- Defaults to ALL_PUNCTUATION

Example Configuration 1

{
 "dontSplitChars":"."

}

Splits on all punctuation, except periods.

For example, the token: "SagaToolkit-1.0" will produce the following graph:

V-------[SagaToolkit-1.0]-------V

 ^---[SagaToolkit]--V--[1.0]----^

Example Configuration 1

{
 "dontSplitChars":"."

}

(split blocks on two new-lines)

TEXT_BLOCK_SPLIT - Identifies the vertex as a split between text blocks.
OVERFLOW_SPLIT - Identifies that an entire buffer was read without finding a split between text blocks.
- The current maximum size of a text block is 64K characters.
- Text blocks larger than this will be arbitrarily split, and the vertex will be marked with "OVERFLOW_SPLIT"\