Page History

Excerpt
Splits tokens on specified characters, typically punctuation. Multiple split characters in a row will create a single split (not multiple splits).

Operates On: Lexical Items with TOKEN and possibly other flags as specified below.

Saga_is_recognizer

Recognizer	false

Include Page

	Generic Configuration Parameters
	Generic Configuration Parameters

Configuration Parameters

splitChars (string, optional) -
Parameter
summary
List of characters which should be used to split tokens
.
name splitChars
- If not present, then tokens are split on any sequence of punctuation.
dontSplitChars (string, optional) -
Parameter
summary
List of characters which will NOT be used to split tokens.
name dontSplitChars
- This is typically used to identify exceptions (characters which are not used to split tokens) when splitChars is missing.
- These characters are included in the produced tokens.
Parameter
summary if any character in this list occurs inside a token, that token will be split just before that character
name splitBeforeChars
Parameter
summary if any character in this list occurs inside a token, that token will be split just after that character
name splitAfterChars
Parameter
summary true/false whether to split on all punctuation (default: true)
name splitPrefixChars
Parameter
summary true/false whether to split on all punctuation (default: true)
name splitSuffixChars
splitFlag (string, optional) -
Parameter
summary
The flag to be put on the vertex between the two tokens.
name splitFlag
- If missing, defaults to ALL_PUNCTUATION.
skipFlags (string array, optional) - Flags to be skipped by this stage
- Tokens marked with this flags will be ignore by this stage, and no process will be performed.
requiredFlags (string array, optional)
- Tokens need to have all the specified flags, in order to be processed
debug (boolean, optional)
- Enable all debug log functionality of the stage, if any.

Examples

Code Block

boundaryFlags	text block split
stage	CharacterSplitter
requiredFlags	token
language	js

"dontSplitChars": ".",
"splitChars":"-",
"splitFlag":"DASH_SPLIT"

Example Output

Code Block

boundaryFlags	text block split
stage	CharacterSplitter
requiredFlags	token
language	js

"dontSplitChars": "."

Code Block

language	js
theme	Eclipse
title	Example Configuration 1

{
 "type":"CharacterSplitter",
 "dontSplitChars":"."
}

Splits on all punctuation, except periods periods.

For example, the token: "SagaToolkit-1.0" will produce the following graph:

Code Block

language	text	theme	FadeToGrey

V-------[SagaToolkit-1.0]-------V
 ^----[SagaToolkit]--V--[1.0]----^

Code Block

language	js
theme	Eclipse
title	Example Configuration 1

{
  "type":"CharacterSplitter",
  "splitChars":"-",
  "splitFlag":"DASH_SPLIT"
}

(splits tokens dashes)

Output Flags

Lex-Item Flags:

TOKEN - All tokens produced are tagged as TOKEN

Vertex Flags:

ALL_PUNCTUATION - Identifies the vertex as all token
- The default flag if no "splitFlag" is present.
<splitFlag> - Defines an alternative flag to ALL_PUNCTUATION, if desired (see above)

Example

...

Code Block

language	text	theme	FadeToGrey

V-----[Abe-Lincoln]-----V--[likes]--V--[the]--V-----[iPhone-*&@#*&7.0]-----V
^--[Abe]--V--[Lincoln]--^                     ^--[iPhone]--V--[7]--V--[0]--^

With Don't Split Param

code

Code Block

boundaryFlags	text block split
stage	CharacterSplitter
requiredFlags	token


language	js
theme	Eclipse
title	With Don't Split Param

"{
	"type":"CharacterSplitter",
	"dontSplitChars": "."
}

Code Block

language	text
theme	FadeToGrey

V-----[Abe-Lincoln]-----V--[likes]--V--[the]--V--[iPhone-*&@#*&7.0]--V
^--[Abe]--V--[Lincoln]--^                     ^--[iPhone]--V--[7.0]--^

With Split Chars Param

code

Code Block

boundaryFlags	text block split
stage	CharacterSplitter
requiredFlags	token


language	js
theme	Eclipse
title	With Split Chars Param

"{
	"type":"CharacterSplitter",
	"splitChars": "-#."
	"dontSplitChars": "."
}

Code Block

language	text
theme	FadeToGrey

V-----[Abe-Lincoln]-----V--[likes]--V--[the]--V--------[iPhone-*&@#*&7.0]--------V
^--[Abe]--V--[Lincoln]--^                     ^--[iPhone]--V--[*&@]--V--[*&7.0]--^

Output Flags

Lex-Item Flags:

TOKEN - All tokens produced are tagged as TOKEN
ALL_PUNCTUATION - Tokens processed or produced composed only of punctuation characters are tagged as ALL_PUNCTUATION.
HAS_DIGIT - Tokens produced with at least one digit character are tagged as HAS_DIGIT.
HAS_PUNCTUATION - Tokens produced with at least one punctuation character are tagged as HAS_PUNCTUATION. (ALL_PUNCTUATION will not be tagged as HAS_PUNCTUATION).
ALL_DIGITS - All characters in the token are digits.
HAS_LETTER - At least one character is a letter.
ALL_LETTERS - All characters in the token are letters.

Vertex Flags:

if no flag is set on the "splitFlag" parameter:
- ALL_PUNCTUATION - Tokens processed or produced composed only of punctuation characters are tagged as ALL_PUNCTUATION.
- HAS_DIGIT - Tokens produced with at least one digit character are tagged as HAS_DIGIT.
- HAS_PUNCTUATION - Tokens produced with at least one punctuation character are tagged as HAS_PUNCTUATION. (ALL_PUNCTUATION will not be tagged as HAS_PUNCTUATION).
- ALL_DIGITS - All characters in the token are digits.
- HAS_LETTER - At least one character is a letter.
- ALL_LETTERS - All characters in the token are letters ALL_PUNCTUATION - Tokens processed or produced composed only of punctuation characters are tagged as ALL_PUNCTUATION.
And if not if no flag is set on the "splitFlag" parameter, not extra tag is added.

Page tree

Versions Compared

Old Version 12

New Version Current

Key

Configuration Parameters

Examples

Example Output

Output Flags

Lex-Item Flags:

Vertex Flags:

Example

With Don't Split Param

With Split Chars Param

Output Flags

Lex-Item Flags:

Vertex Flags: