Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

Splits tokens on specified characters, typically punctuation. Multiple split characters in a row will create a single split (not multiple splits) when using the "Split as Vertex" setting.

Info

Uses Character Splitter Stage

Configuration

  • Split Flag - The flag to be put on the vertex between the two tokens.
    • If missing, defaults to ALL_PUNCTUATION.

Tokenize Tokens

Split as Vertex

  • Split CharactersToken Delimiters - List of characters which should be used to split tokens
    • If not present, then tokens are split on any sequence of punctuation. 
    • A new vertex will be created covering the split characters. 
  • Don't Split Allowed Token Characters - List of characters which will NOT be used to split tokens.
    • This is typically used to identify exceptions (characters which are not used to split tokens) when Token Delimiters  Split Characters is missing.
    • These characters are included in the produced tokens.

Split Characters In Between (Before/After) - split occurs when split characters are located in the middle of a token text.

  • Split Before character - if any character in this list occurs inside a token, that token will be split just before that character
  • Split After character - if any character in this list occurs inside a token, that token will be split just after that character

Split CharactersAs Prefix/Suffix - Split occurs if the split characters are located at the beginning (prefix) or the end (suffix) of the token text.

  • Split Prefix Characters - list of split characters that appear at the beginning of the token.
  • Split Suffix Characters - list of split characters that appear at the end of the token.
  • At start of token - true/false whether to split on all punctuation (default: true)
  • At end of token - true/false whether to split on all punctuation (default: true)


General Settings

Include Page
Generic Processor Config
Generic Processor Config