You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Splits tokens on specified characters, typically punctuation.

Operates On:  Lexical Items with TOKEN

Configuration Parameters

  • splitChars (optional) - List of characters which should be used to split tokens.
    • Note 1 or more of these characters in sequence will be a single split.
    • If not present, then tokens are split on any sequence of punctuation.
  • dontSplitChars (optional) - List of characters which will NOT be used to split tokens.
    • This is typically used to identify exceptions (characters which are not used to split tokens) when splitChars is not present.
    • dontSplitChars will be added to the appropriate token to which they belong (they are token characters)
  • splitFlag (optional) - The flag to be put on the vertex between the two tokens
    • Defaults to ALL_PUNCTUATION 


Example Configuration 1
{
 "dontSplitChars":"."

}

Splits on all punctuation, except periods.

For example, the token:  "SagaToolkit-1.0" will produce the following graph:

V-------[SagaToolkit-1.0]-------V
 ^---[SagaToolkit]--V--[1.0]----^


Example Configuration 1
{
 "dontSplitChars":"."

}



(split blocks on two new-lines)

Flags

Lex-Item Flags:

  • TEXT_BLOCK - Flags all text blocks produced by the SimpleReader

Vertex Flags:

  • TEXT_BLOCK_SPLIT - Identifies the vertex as a split between text blocks.
  • OVERFLOW_SPLIT - Identifies that an entire buffer was read without finding a split between text blocks.
    • The current maximum size of a text block is 64K characters.
    • Text blocks larger than this will be arbitrarily split, and the vertex will be marked with "OVERFLOW_SPLIT"\


  • No labels