Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

This stage uses Apache Lucene™ to create custom pipelines apart from the default selection of pipelines. It offers a large amount of possible customization options and filters to adapt to the users needs.

Info

Uses Lucene Pipeline Stage

Configuration

  • Parameter
    summaryTokenizer to use for the pipeline (only one can be used at a time).
    defaultNone
    nameTokenizer
    requiredtrue

    • Each Tokenizer can change the UI and have its own set of configurations.
  • Parameter
    summaryFilter to use for the pipeline (can be stacked).
    defaultNone
    nameFilter


Tokenizers Available

  • Chinese - Tokenizer for Chinese.
  • Classic - The same as the Standard Tokenizer but does not use Unicode standard.
    • It has a maxTokenLength (integer) limit that can be customized.
  • EdgeNGram - Same as Ngram but from the beginning (front) of the text or from the end (back).
    • Uses min and max gram size.
    • It has a side ("front", "back") configuration that can be customized.
  • Japanese - Tokenizer for Japanese that uses morphological analysis.
  • JapaneseSen - Tokenizer for Japanese for Kanji.
  • Keyword - This tokenizer treats the entire text field as a single token.
  • Korean - Tokenizer for Korean, it has different options to adapt the bigrams split.
  • Letter - This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter characters.
  • Ngram - Reads the field text and generates n-gram tokens.
    • It has a minGramSize (integer) configuration that can be customized.
    • It has a maxGramSize (integer) configuration that can be customized.
  • PathHierarchy - This tokenizer creates synonyms from file path hierarchies.
    • It has a delimiter (character) configuration that can be customized.
    • It has a replace (character) configuration that can be customized.
  • Standard - This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters.
    • It has a maxTokenLength (integer) limit that can be customized.
  • Thai - Tokenizer that use BreakIterator to tokenize Thai text.
  • Uax29UrlEmail - This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Supports Unicode standard annex UAX#29 word boundaries.
    • It has a maxTokenLength (integer) limit that can be customized.
  • Whitespace - Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens. Note that any punctuation will be included in the tokens.
    • It has a rule ("java" or "unicode") limit that can be customized.
  • Wikipedia - Extension of Standard Tokenizer that is aware of Wikipedia syntax.
    • Refer to this page to see all configurations.

Filters Available

  • It has Normalization and/or Stem for these languages:
    • Arabic
    • Bengali
    • Brazilian
    • Bulgarian
    • Czech
    • Chinese 
    • English
    • Finnish
    • French
    • Galician
    • German
    • Greek
    • Hindi
    • Hungarian
    • Indic
    • Indonesian
    • Italian
    • Korean 
    • Latvian
    • Norwegian
    • Persian
    • Portuguese
    • Russian
    • Scandinavian
    • Serbian
    • Sorani
    • Spanish
    • Swedish
    • Turkish
  • Then it has other filters like:
    • Apostrophe - To check for text with apostrophes.
    • ASCII Folding - This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin Unicode block (the first 127 ASCII characters) to their ASCII equivalents, if one exists.
    • Classic - This filter takes the output of the Classic Tokenizer and strips periods from acronyms and "'s" from possessives.
    • Keep Word - This filter discards all tokens except those that are listed in the given word list.
    • Length Filter - This filter passes tokens whose length falls within the min/max limit specified. All other tokens are discarded.
    • And a lot more filters.

General Settings

Include Page
Generic Processor Config
Generic Processor Config