Character Splitter

Splits tokens on specified characters, typically punctuation. Multiple split characters in a row will create a single split (not multiple splits) when using the "Split as Vertex" setting.

Uses Character Splitter Stage

Configuration

Split Flag - The flag to be put on the vertex between the two tokens.
- If missing, defaults to ALL_PUNCTUATION.

Split as Vertex

Split Characters - List of characters which should be used to split tokens
- A new vertex will be created covering the split characters.
Don't Split Characters - List of characters which will NOT be used to split tokens.
- This is typically used to identify exceptions (characters which are not used to split tokens) when Split Characters is missing.
- These characters are included in the produced tokens.

Split In Between (Before/After) - split occurs when split characters are located in the middle of a token text.

Split Before character - if any character in this list occurs inside a token, that token will be split just before that character
Split After character - if any character in this list occurs inside a token, that token will be split just after that character

Split As Prefix/Suffix - Split occurs if the split characters are located at the beginning (prefix) or the end (suffix) of the token text.

Split Prefix Characters - list of split characters that appear at the beginning of the token.
Split Suffix Characters - list of split characters that appear at the end of the token.

General Settings

The general settings can be accessed by clicking on

Enable - Enables the processor to be used in the pipeline.
Skip Flags ( optional ) - Lexical items flags to be ignored by this processor.
Boundary Flags ( optional ) - List of vertex flags that indicate the beginning and end of a text block.
Required Flags ( optional ) - Lexical items flags required by every token to be processed.
At Least One Flags ( optional ) - List of lexical item flags where at least one of them needs to be present to be processed.
Don't Process Flags ( optional ) - List of lexical items flags that are not processed. The difference with "Skip Flags" is that this will drop the path in the Saga graph, skip just skips the token and continues in the same path.
Confidence Adjustment - Adjustment factor to apply to the confidence value of 0.0 to 2.0 from (Applies for every match).
- 0.0 to < 1.0 decreases confidence value
- 1.0 confidence value remains the same
- > 1.0 to 2.0 increases confidence value
Debug - Enable debug logging.

Page tree

Character Splitter

Configuration

General Settings