This stage uses Apache Lucene™ to create custom pipelines apart from the default selection of pipelines. It offers a large amount of possible customization options and filters to adapt to the users needs.

Configuration

  • Tokenizer ( type=string | default=None | required ) - Tokenizer to use for the pipeline (only one can be used at a time).
    • Each Tokenizer can change the UI and have its own set of configurations.
  • Filter ( type=string | default=None | optional ) - Filter to use for the pipeline (can be stacked).


Tokenizers Available

  • Chinese - Tokenizer for Chinese.
  • Classic - The same as the Standard Tokenizer but does not use Unicode standard.
    • It has a maxTokenLength (integer) limit that can be customized.
  • EdgeNGram - Same as Ngram but from the beginning (front) of the text or from the end (back).
    • Uses min and max gram size.
    • It has a side ("front", "back") configuration that can be customized.
  • Japanese - Tokenizer for Japanese that uses morphological analysis.
  • JapaneseSen - Tokenizer for Japanese for Kanji.
  • Keyword - This tokenizer treats the entire text field as a single token.
  • Korean - Tokenizer for Korean, it has different options to adapt the bigrams split.
  • Letter - This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter characters.
  • Ngram - Reads the field text and generates n-gram tokens.
    • It has a minGramSize (integer) configuration that can be customized.
    • It has a maxGramSize (integer) configuration that can be customized.
  • PathHierarchy - This tokenizer creates synonyms from file path hierarchies.
    • It has a delimiter (character) configuration that can be customized.
    • It has a replace (character) configuration that can be customized.
  • Standard - This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters.
    • It has a maxTokenLength (integer) limit that can be customized.
  • Thai - Tokenizer that use BreakIterator to tokenize Thai text.
  • Uax29UrlEmail - This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Supports Unicode standard annex UAX#29 word boundaries.
    • It has a maxTokenLength (integer) limit that can be customized.
  • Whitespace - Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens. Note that any punctuation will be included in the tokens.
    • It has a rule ("java" or "unicode") limit that can be customized.
  • Wikipedia - Extension of Standard Tokenizer that is aware of Wikipedia syntax.
    • Refer to this page to see all configurations.

Filters Available

  • It has Normalization and/or Stem for these languages:
    • Arabic
    • Bengali
    • Brazilian
    • Bulgarian
    • Czech
    • Chinese 
    • English
    • Finnish
    • French
    • Galician
    • German
    • Greek
    • Hindi
    • Hungarian
    • Indic
    • Indonesian
    • Italian
    • Korean 
    • Latvian
    • Norwegian
    • Persian
    • Portuguese
    • Russian
    • Scandinavian
    • Serbian
    • Sorani
    • Spanish
    • Swedish
    • Turkish
  • Then it has other filters like:
    • Apostrophe - To check for text with apostrophes.
    • ASCII Folding - This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin Unicode block (the first 127 ASCII characters) to their ASCII equivalents, if one exists.
    • Classic - This filter takes the output of the Classic Tokenizer and strips periods from acronyms and "'s" from possessives.
    • Keep Word - This filter discards all tokens except those that are listed in the given word list.
    • Length Filter - This filter passes tokens whose length falls within the min/max limit specified. All other tokens are discarded.
    • And a lot more filters.

General Settings

The general settings can be accessed by clicking on 


  • Enable - Enables the processor to be used in the pipeline.
  • Skip Flags ( optional ) - Lexical items flags to be ignored by this processor.
  • Boundary Flags  ( optional ) - List of vertex flags that indicate the beginning and end of a text block.
  • Required Flags ( optional ) - Lexical items flags required by every token to be processed.
  • At Least One Flags ( optional ) - List of lexical item flags where at least one of them needs to be present to be processed.
  • Don't Process Flags ( optional ) - List of lexical items flags that are not processed. The difference with "Skip Flags" is that this will drop the path in the Saga graph, skip just skips the token and continues in the same path.
  • Confidence Adjustment - Adjustment factor to apply to the confidence value of 0.0 to 2.0 from (Applies for every match).
    • 0.0 to < 1.0  decreases confidence value
    • 1.0 confidence value remains the same
    • > 1.0 to  2.0 increases confidence value
  • Debug - Enable debug logging.

  • No labels