Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Splits tokens on specified characters, typically punctuation. Multiple split characters in a row will create a single split (not multiple splits).

Operates On:  Lexical Items with TOKEN

...

  • splitChars (optional) - List of characters which should be used to split tokens.Note 1 or more of these characters in sequence will be a single split.
    • If not present, then tokens are split on any sequence of punctuation. 
  • dontSplitChars (optional) - List of characters which will NOT be used to split tokens.
    • This is typically used to identify exceptions (characters which are not used to split tokens) when splitChars is not presentmissing.dontSplitChars will be added to the appropriate token to which they belong (they are token characters)
    • These characters are included in the produced tokens.
  • splitFlag (optional) - The flag to be put on the vertex between the two tokens.
    • Defaults If missing, defaults to ALL_PUNCTUATION PUNCTUATION.


Code Block
languagejs
titleExample Configuration 1
{
 "dontSplitChars":"."

}

Splits on all punctuation, except periods.

...

Code Block
languagejs
titleExample Configuration 1
{
  "dontSplitCharssplitChars":".-",
  "splitFlag":"DASH_SPLIT"
}

(split blocks on two new-linessplits tokens dashes)

Flags

Lex-Item Flags:

  • TEXT_BLOCK - Flags all text blocks produced by the SimpleReaderTOKEN - All tokens produced are tagged as TOKEN

Vertex Flags:

  • TEXTALL_BLOCK_SPLIT PUNCTUATION - Identifies the vertex as a split between text blocks.OVERFLOW_SPLIT - Identifies that an entire buffer was read without finding a split between text blocks.
  • The current maximum size of a text block is 64K characters.
  • Text blocks larger than this will be arbitrarily split, and the vertex will be marked with "OVERFLOW_SPLIT"\all token
    • The default flag if no "splitFlag" is present.
  • <splitFlag> - Defines an alternative flag to ALL_PUNCTUATION, if desired (see above)