Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

Looks up sequences of tokens in a dictionary and then tags the sequence with one or more semantic tags as an alternative representation. Typically, these tags represent entities such as {person}, {place}, {company}, etc.

Info

Uses Dictionary Tagger Stage

Configuration

Image Added

Info
titleNote

If "Use Spellchecking" checked, new configurations appear

Image AddedImage Removed

  • Parameter
    summaryexplanationList of tags to be ignored by the recognizer
    nameIgnore Tags
    typestring array
  • Parameter
    summaryReplace characters such as "á, ö, ç" with their normalize forms "a, o, c"
    defaultunchecked
    nameRemove Accents & Diacritics
    typeboolean
  • Parameter
    summaryReplace the characters specified with white space
    defaultunchecked
    nameRemove Characters
    typeboolean
    • Parameter
      summaryWill only be used if Remove Characters is checked
      default_-‿⁀⁔︳︴﹍﹎﹏_
      nameCharacters to remove
  • Parameter
    summaryActivate matching based on Total Match.
    defaultfalse
    nameTag Only If Total Match
    typeboolean
    • Parameter
      summaryOptional threshold to match based on coverage of text matched.
      default1.0
      nameTotal Match Threshold
      typedouble
  • Parameter
    summaryActivates spellchecking to recognize entities misspelled.
    defaultunchecked
    nameUse Spellchecking
    typeboolean
    • Parameter
      summarySimilarity score for misspelled texts.
      default0.7
      nameSpellchecking score threshold
      typedouble
      requiredtrue
    • Parameter
      summaryMaximum number of suggestions for the misspelled element.
      default5
      nameMax of Suggestions
      typeinteger
      requiredtrue
    • Parameter
      summaryThe spellchecking algorithm to use.
      defaultLevenshtein
      nameSpellchecking Algorithm
      requiredtrue
      • The available algorithms are:

    Adding a Entity

    By clicking in the Image Removed which will popup the Add new Entity dialog

          • Levenshtein - The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
          • JaroWinkler - The Jaro–Winkler distance is a string metric measuring an edit distance between two sequences. The lower the Jaro–Winkler distance for two strings is, the more similar the strings are.
          • LuceneLevenshtein - The same Levenshtein algorithm but using Lucene.
          • nGram - n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.
          • Soundex - Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.
    • Parameter
      summaryCosine factor for accepting entities.
      default0.7
      nameVector Cosine Similarity Acceptance
      typedouble
    • Parameter
      summaryInternal tokenizer of the recognizer
      defaultLatin-script Alphabet
      nameLanguage Tokenizer
      • The available tokenizers are:
        • Latin-script alphabet
        • Korean
        • Japanese
        • Chinese

    Adding a Entity

    Click on the Image Added button which will popup the "Add new Entity" dialog

    Image AddedImage Removed


    • Parameter
      summaryPatterns to look for
      namePatterns
      typestring array
      requiredtrue
    • Parameter
      summaryID assign to the set of patterns
      defaultautogenerated
      nameID
      requiredtrue
      • Normally use to match the ID in a database or a key for an API
    • Parameter
      summaryDisplay use for normalization
      nameDisplay
      requiredtrue
    • Parameter
      summaryAdjustment factor to apply to the confidence value of 0.0 to 2.0 from (Applies for every pattern match).
      default1
      nameConfidence Adjustment
      typedouble
      requiredtrue
      • 0.0 to < 1.0  decreases confidence value
      • 1.0 confidence value remains the same
      • > 1.0 to  2.0 increases confidence value
    • The final space is for more custom configurations.

    Image AddedImage Removed

    General Settings

    Include Page
    Generic Recognizer Config
    Generic Recognizer Config