Page History

Excerpt

Looks up sequences of tokens in a dictionary and then tags the sequence with one or more semantic tags as an alternative representation. Typically, these tags represent entities such as {person}, {place}, {company}, etc.

Info
Uses Dictionary Tagger Stage

Configuration

Image RemovedImage Added

Info

title	Note

If "Use Spellchecking" checked, new configurations appear

Parameter
summary List of tags to be ignored by the recognizer
name Ignore Tags
type string array
Parameter
summary Replace characters such as "á, ö, ç" with their normalize forms "a, o, c"
default unchecked
name Remove Accents & Diacritics
type boolean
Parameter
summary Replace the characters specified with white space
default unchecked
name Remove Characters
type boolean
- Parameter
  summary Will only be used if Remove Characters is checked
  default _-‿⁀⁔︳︴﹍﹎﹏＿
  name Characters to remove
Parameter
summary Activates spellchecking to recognize entities misspelled.
default unchecked
name Use Spellchecking
type boolean
- Parameter
  summary Similarity score for misspelled texts.
  default 0.7
  name Spellchecking score threshold
  type double
  required true
- Parameter
  summary Maximum number of suggestions for the misspelled element.
  default 5
  name Max of Suggestions
  type integer
  required true
- Parameter
  summary The spellchecking algorithm to use.
  default Levenshtein
  name Spellchecking Algorithm
  required true
  - The available algorithms are:
    - Levenshtein - The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
    - JaroWinkler - The Jaro–Winkler distance is a string metric measuring an edit distance between two sequences. The lower the Jaro–Winkler distance for two strings is, the more similar the strings are.
    - LuceneLevenshtein - The same Levenshtein algorithm but using Lucene.
    - nGram - n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.
    - Soundex - Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.
Parameter
summary Cosine factor for accepting entities.
default 0.7
name Vector Cosine Similarity Acceptance
type double
Parameter
summary Internal tokenizer of the recognizer
default Latin-script Alphabet
name Language Tokenizer
- The available tokenizers are:
  - Latin-script alphabet
  - Korean
  - Japanese
  - Chinese

Adding a Entity

Click on the button which will popup the "Add new Entity" dialog

Parameter
summary Patterns to look for
name Patterns
type string array
required true
Parameter
summary ID assign to the set of patterns
default autogenerated
name ID
required true
- Normally use to match the ID in a database or a key for an API
Parameter
summary Display use for normalization
name Display
required true
Parameter
summary Adjustment factor to apply to the confidence value of 0.0 to 2.0 from (Applies for every pattern match).
default 1
name Confidence Adjustment
type double
required true
- 0.0 to < 1.0 decreases confidence value
- 1.0 confidence value remains the same
- > 1.0 to 2.0 increases confidence value
The final space is for more custom configurations.

General Settings

Include Page

	Generic Recognizer Config
	Generic Recognizer Config

Page tree

Versions Compared

Old Version 7

New Version 8

Key

Configuration

Adding a Entity

General Settings