...
...
...
Operates On: Lexical Items with TOKEN and possibly other flags as specified below.
Configuration
...
Include Page |
---|
| Generic Configuration Parameters |
---|
| Generic Configuration Parameters |
---|
|
Configuration Parameters
Parameter |
---|
summary | The selected spellchecking algorithm. |
---|
default | "Levenshtein" |
---|
name | algori |
---|
|
Parameter |
---|
summary | Extra parameters for the selected spellchecking algorithm. |
---|
name | algorithm_params |
---|
|
Parameter |
---|
summary | The dictionary resource that holds the names and that is to be located in the text |
---|
|
...
| name | dictionary |
---|
required | true |
---|
|
- This is specified as "provider:name" in the standard resource format
...
- A JSON array of strings, such as ["TOKEN", "ALL_LOWER_CASE"]
...
language | js |
---|
title | Example Configuration |
---|
...
Parameter |
---|
summary | Ignore matches with tags specified in the ignoreTags list |
---|
name | dontProcessTags |
---|
type | string array |
---|
|
Parameter |
---|
summary | Name of the tag which indicates what entities should be process |
---|
name | entity |
---|
|
Parameter |
---|
summary | Removes accents and diacritics and generates a new pattern |
---|
default | false |
---|
name | normalizeAccents |
---|
|
Parameter |
---|
summary | Indicates if characters should be removed from the pattern using a list creating a new pattern |
---|
default | false |
---|
name | removeChars |
---|
type | boolean |
---|
|
Parameter |
---|
summary | List of characters to remove from the pattern |
---|
default | _-‿⁀⁔︳︴﹍﹎﹏_ |
---|
name | charsList |
---|
|
Parameter |
---|
summary | Threshold use to filter when doing vector similarity |
---|
default | 0.7 |
---|
name | cosineSimThreshold |
---|
type | double |
---|
|
Parameter |
---|
summary | Activate spellchecking to find entities misspelled. |
---|
default | false |
---|
name | spellchecking |
---|
type | boolean |
---|
|
Parameter |
---|
summary | Activate matching based on Total Match. |
---|
default | false |
---|
name | matchAll |
---|
type | boolean |
---|
|
Parameter |
---|
summary | Optional threshold to match based on coverage of text matched. |
---|
default | 1.0 |
---|
name | matchAllThreshold |
---|
type | double |
---|
|
Code Block |
---|
boundaryFlags | text block split |
---|
stage | DictionaryTagger |
---|
requiredFlags | token, semantic tag |
---|
language | js |
---|
skipFlags | skip |
---|
|
"algorithm": "Levenshtein"
"algorithm_params": {}
"dictionary": "dict-provider:people-lowercase",
"dontProcessTags": |
...
["color", "currency"],
"normalizeAccents": false,
"removeChars": false,
"charsList": "_-‿⁀⁔︳︴﹍﹎﹏_"
"spellchecking": false
"cosineSimThreshold": 0.7
"lowercase": true
"matchAll": false
"matchAllThreshold": 1.0 |
"people-lowercase" resource must be in the format |
...
Example Output
In the following example, "abraham lincoln" is in the dictionary as a person, "lincoln" as a place, and "macaroni", "cheese" and "macaroni and cheese" are all specified as foods:
Code Block |
---|
|
V--------------[abraham lincoln likes macaroni and cheese]--------------------V |
...
^--[abraham]--V--[lincoln]--V--[likes]--V--[macaroni]--V--[and]--V--[cheese]--^ |
...
...
...
...
...
...
...
...
...
...
...
...
Output Flags
Lex-Item Flags
...
- SEMANTIC_TAG - Identifies all lexical items
...
- that are semantic tags.
- PARTIAL_MATCH - Identifies partial matches of patterns.
- TOTAL_MATCH - Identifies total matches of patterns.
- ENTITY - Identifies the token as an entity.
- MISSPELL - Identifies tokens with errors or misspells.
Vertex Flags:
Info |
---|
No vertices are created in this stage |
Resource Data
The dictionary tagger must have an "entity dictionary" (a string to JSON map) which is a list of JSON records, indexed by entity ID. In addition, there may also be a pattern map and a token index.
The only file
...
that is absolutely required is the entity dictionary. It is a series of JSON records, typically indexed by entity ID.
Each JSON record represents an entity. The format is as follows:
Code Block |
---|
Title | Entity Json Format |
---|
language | js |
---|
|
...
"_id" : "KGAAJGsBemSwA0nZTLXA",
"id":"Q28260",
|
...
...
...
...
"display": "Lincon"
"patterns":[
|
...
"Lincoln", "Lincoln, Nebraska", "Lincoln, NE"
|
...
...
coord": [40.813639, -96.702611]
}
"confAdjust": 0.95
|
...
. . . additional fields as needed go here . . . |
...
Fields
- id (required, string) - Identifies the entity by unique ID. This identifier must be unique across all entities (across all dictionaries) regardless of the type.
- Typically this is an identifier with meaning to the larger application which is using the Language Processing Toolkit.
- tags (required, array of string) -
- patterns (required, array of string) -
- confidence (optional, float) -
Other, Optional Fields
- display (optional, string) -
- context (optional, object) -
Pattern Map
Note |
---|
- Multiple entries can have the same pattern. If the pattern is matched, then it will be tagged with multiple (ambiguous) entry IDs.
- Additional fielded data can be added to the record; as needed by downstream processes.
|
Fields
Parameter |
---|
summary | An ID normally refering the ID of a database, a document, an API key, not necessary unique |
---|
name | id |
---|
required | true |
---|
|
Parameter |
---|
summary | Tag which will identify any match in the graph, as an interpretation |
---|
name | tag |
---|
required | true |
---|
|
Parameter |
---|
summary | A list of patterns to match in the content |
---|
name | patterns |
---|
type | string array |
---|
required | true |
---|
|
Parameter |
---|
summary | What to show the user when browsing this entity |
---|
name | display |
---|
|
Parameter |
---|
summary | Free space to add extra data in any format supported by JSON |
---|
name | fields |
---|
type | json |
---|
|
Include Page |
---|
| Generic Resource Fields |
---|
| Generic Resource Fields |
---|
|
Dictionary Index
To improve performance especially for every large databases of entities, the entity dictionary is inverted and indexed.
This currently happens in RAM inside the DictionaryTagger stage. An off-line option for pre-inverting the dictionary will be provided in the future.
...