Looks up sequences of tokens in a dictionary and then tags the sequence with one or more semantic tags as an alternative representation. Typically, these tags represent entities such as {person}, {place}, {company}, etc.

Configuration

Note

If "Use Spellchecking" checked, new configurations appear

  • Ignore Tags ( type=string array | optional ) - List of tags to be ignored by the recognizer
  • Remove Accents & Diacritics ( type=boolean | default=unchecked | optional ) - Replace characters such as "á, ö, ç" with their normalize forms "a, o, c"
  • Remove Characters ( type=boolean | default=unchecked | optional ) - Replace the characters specified with white space
    • Characters to remove ( type=string | default=_-‿⁀⁔︳︴﹍﹎﹏_ | optional ) - Will only be used if Remove Characters is checked
  • Tag Only If Total Match ( type=boolean | default=false | optional ) - Activate matching based on Total Match.
    • Total Match Threshold ( type=double | default=1.0 | optional ) - Optional threshold to match based on coverage of text matched.
  • Use Spellchecking ( type=boolean | default=unchecked | optional ) - Activates spellchecking to recognize entities misspelled.
    • Spellchecking score threshold ( type=double | default=0.7 | required ) - Similarity score for misspelled texts.
    • Max of Suggestions ( type=integer | default=5 | required ) - Maximum number of suggestions for the misspelled element.
    • Spellchecking Algorithm ( type=string | default=Levenshtein | required ) - The spellchecking algorithm to use.
      • The available algorithms are:
        • Levenshtein - The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
        • JaroWinkler - The Jaro–Winkler distance is a string metric measuring an edit distance between two sequences. The lower the Jaro–Winkler distance for two strings is, the more similar the strings are.
        • LuceneLevenshtein - The same Levenshtein algorithm but using Lucene.
        • nGram - n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.
        • Soundex - Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.
  • Vector Cosine Similarity Acceptance ( type=double | default=0.7 | optional ) - Cosine factor for accepting entities.
  • Language Tokenizer ( type=string | default=Latin-script Alphabet | optional ) - Internal tokenizer of the recognizer
    • The available tokenizers are:
      • Latin-script alphabet
      • Korean
      • Japanese
      • Chinese

Adding a Entity

Click on the  button which will popup the "Add new Entity" dialog


  • Patterns ( type=string array | required ) - Patterns to look for
  • ID ( type=string | default=autogenerated | required ) - ID assign to the set of patterns
    • Normally use to match the ID in a database or a key for an API
  • Display ( type=string | required ) - Display use for normalization
  • Confidence Adjustment ( type=double | default=1 | required ) - Adjustment factor to apply to the confidence value of 0.0 to 2.0 from (Applies for every pattern match).
    • 0.0 to < 1.0  decreases confidence value
    • 1.0 confidence value remains the same
    • > 1.0 to  2.0 increases confidence value
  • The final space is for more custom configurations.

General Settings

The general settings can be accessed by clicking on

More settings could be displayed in the same dialog, it varies per recognizer.


  • Enable - Enable the processor to be use in pipelines.
  • Base Pipeline - Indicates the last stage, from a pipeline, needed by the recognizer.
  • Skip Flags ( optional ) - Lexical items flags to be ignored by this processor.
  • Boundary Flags  ( optional ) - List of vertex flags that indicate the beginning and end of a text block.
  • Required Flags ( optional ) - Lexical items flags required by every token to be processed.
  • At Least One Flag ( optional ) - Lexical items flags needed by every token to be processed.
  • Don't Process Flags ( optional ) - List of lexical items flags that are not processed. The difference with "Skip Flags" is that this will drop the path in the Saga graph, skip just skips the token and continues in the same path.
  • Confidence Adjustment - Adjustment factor to apply to the confidence value of 0.0 to 2.0 from (Applies for every match).
    • 0.0 to < 1.0  decreases confidence value
    • 1.0 confidence value remains the same
    • > 1.0 to  2.0 increases confidence value
  • Debug - Enable debug logging.

  • No labels