Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

Looks up sequences of tokens in a dictionary and then tags the sequence with one or more semantic tags as an alternative representation. Typically, these tags represent entities such as {person}, {place}, {company}, etc.

Info

Uses Dictionary Tagger Stage

Configuration

Image RemovedImage Added

Info
titleNote

If "Use Spellchecking" checked, new configurations appear

  • Parameter
    summaryList of tags to be ignored by the recognizer
    nameIgnore Tags
    typestring array
  • Parameter
    summaryReplace characters such as "á, ö, ç" with their normalize forms "a, o, c"
    defaultunchecked
    nameRemove Accents & Diacritics
    typeboolean
  • Parameter
    summaryReplace the characters specified with white space
    defaultunchecked
    nameRemove Characters
    typeboolean
    • Parameter
      summaryWill only be used if Remove Characters is checked
      default_-‿⁀⁔︳︴﹍﹎﹏_
      nameCharacters to remove
  • Parameter
    summaryActivates spellchecking to recognize entities misspelled.
    defaultunchecked
    nameUse Spellchecking
    typeboolean
    • Parameter
      summarySimilarity score for misspelled texts.
      default0.7
      nameSpellchecking score threshold
      typedouble
      requiredtrue
    • Parameter
      summaryMaximum number of suggestions for the misspelled element.
      default5
      nameMax of Suggestions
      typeinteger
      requiredtrue
    • Parameter
      summaryThe spellchecking algorithm to use.
      defaultLevenshtein
      nameSpellchecking Algorithm
      requiredtrue
      • The available algorithms are:
        • Levenshtein - The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
        • JaroWinkler - The Jaro–Winkler distance is a string metric measuring an edit distance between two sequences. The lower the Jaro–Winkler distance for two strings is, the more similar the strings are.
        • LuceneLevenshtein - The same Levenshtein algorithm but using Lucene.
        • nGram - n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.
        • Soundex - Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.
  • Parameter
    summaryCosine factor for accepting entities.
    default0.7
    nameVector Cosine Similarity Acceptance
    typedouble
  • Parameter
    summaryInternal tokenizer of the recognizer
    defaultLatin-script Alphabet
    nameLanguage Tokenizer
    • The available tokenizers are:
      • Latin-script alphabet
      • Korean
      • Japanese
      • Chinese

Adding a Entity

Click on the  button which will popup the "Add new Entity" dialog


  • Parameter
    summaryPatterns to look for
    namePatterns
    typestring array
    requiredtrue
  • Parameter
    summaryID assign to the set of patterns
    defaultautogenerated
    nameID
    requiredtrue
    • Normally use to match the ID in a database or a key for an API
  • Parameter
    summaryDisplay use for normalization
    nameDisplay
    requiredtrue
  • Parameter
    summaryAdjustment factor to apply to the confidence value of 0.0 to 2.0 from (Applies for every pattern match).
    default1
    nameConfidence Adjustment
    typedouble
    requiredtrue
    • 0.0 to < 1.0  decreases confidence value
    • 1.0 confidence value remains the same
    • > 1.0 to  2.0 increases confidence value
  • The final space is for more custom configurations.

General Settings

Include Page
Generic Recognizer Config
Generic Recognizer Config