...
...
...
Operates On: Lexical Items with TOKEN and possibly other flags as specified below.
Configuration
...
Include Page |
---|
| Generic Configuration Parameters |
---|
| Generic Configuration Parameters |
---|
|
Configuration Parameters
Parameter |
---|
summary | The selected spellchecking algorithm. |
---|
default | "Levenshtein" |
---|
name | algori |
---|
|
Parameter |
---|
summary | Extra parameters for the selected spellchecking algorithm. |
---|
name | algorithm_params |
---|
|
Parameter |
---|
summary | The dictionary resource that holds the names and that is to be located in the text |
---|
|
...
| name | dictionary |
---|
required | true |
---|
|
- This is specified as "provider:name" in the standard resource format
...
- A JSON array of strings, such as ["TOKEN", "ALL_LOWER_CASE"]
...
language | js |
---|
title | Example Configuration |
---|
...
Parameter |
---|
summary | Ignore matches with tags specified in the ignoreTags list |
---|
name | dontProcessTags |
---|
type | string array |
---|
|
Parameter |
---|
summary | Name of the tag which indicates what entities should be process |
---|
name | entity |
---|
|
Parameter |
---|
summary | Removes accents and diacritics and generates a new pattern |
---|
default | false |
---|
name | normalizeAccents |
---|
|
Parameter |
---|
summary | Indicates if characters should be removed from the pattern using a list creating a new pattern |
---|
default | false |
---|
name | removeChars |
---|
type | boolean |
---|
|
Parameter |
---|
summary | List of characters to remove from the pattern |
---|
default | _-‿⁀⁔︳︴﹍﹎﹏_ |
---|
name | charsList |
---|
|
Parameter |
---|
summary | Threshold use to filter when doing vector similarity |
---|
default | 0.7 |
---|
name | cosineSimThreshold |
---|
type | double |
---|
|
Parameter |
---|
summary | Activate spellchecking to find entities misspelled. |
---|
default | false |
---|
name | spellchecking |
---|
type | boolean |
---|
|
Parameter |
---|
summary | Activate matching based on Total Match. |
---|
default | false |
---|
name | matchAll |
---|
type | boolean |
---|
|
Parameter |
---|
summary | Optional threshold to match based on coverage of text matched. |
---|
default | 1.0 |
---|
name | matchAllThreshold |
---|
type | double |
---|
|
Code Block |
---|
boundaryFlags | text block split |
---|
stage | DictionaryTagger |
---|
requiredFlags | token, semantic tag |
---|
language | js |
---|
skipFlags | skip |
---|
|
"algorithm": "Levenshtein"
"algorithm_params": {}
"dictionary": "dict-provider:people-lowercase",
"dontProcessTags": ["color", |
...
"currency"],
"normalizeAccents": |
...
false,
"removeChars": false,
"charsList": "_-‿⁀⁔︳︴﹍﹎﹏_"
"spellchecking": false
"cosineSimThreshold": 0.7
"lowercase": true
"matchAll": false
"matchAllThreshold": 1.0 |
"people-lowercase" resource must be in the format |
...
Example Output
In the following example, "abraham lincoln" is in the dictionary as a person, "lincoln" as a place, and "macaroni", "cheese" and "macaroni and cheese" are all specified as foods:
Code Block |
---|
|
V--------------[abraham lincoln likes macaroni and cheese]--------------------V |
...
^--[abraham]--V--[lincoln]--V--[likes]--V--[macaroni]--V--[and]--V--[cheese]--^ |
...
...
...
...
...
...
...
...
...
...
]--------^ ^---------------- |
...
...
Output Flags
Lex-Item Flags
...
- SEMANTIC_TAG - Identifies all lexical items
...
- that are semantic tags.
- PARTIAL_MATCH - Identifies partial matches of patterns.
- TOTAL_MATCH - Identifies total matches of patterns.
- ENTITY - Identifies the token as an entity.
- MISSPELL - Identifies tokens with errors or misspells.
Vertex Flags:
Info |
---|
No vertices are created in this stage |
Resource Data
The dictionary tagger must have an "entity dictionary" (a string to JSON map) which is a list of JSON records, indexed by entity ID. In addition, there may also be a pattern map and a token index.
The only file
...
that is absolutely required is the entity dictionary. It is a series of JSON records, typically indexed by entity ID.
Each JSON record represents an entity. The format is as follows:
Code Block |
---|
Title | Entity Json Format |
---|
language | js |
---|
|
...
"_id" : "KGAAJGsBemSwA0nZTLXA",
"id":"Q28260",
|
...
...
...
...
"display": "Lincon"
"patterns":[
|
...
"Lincoln", "Lincoln, Nebraska", "Lincoln, NE"
|
...
...
"coord": [40.813639, -96.702611]
}
"confAdjust": 0.95
|
...
. . . additional fields as needed go here . . . |
...
...
...
- entries can have the same pattern.
|
...
- If the pattern is matched, then it will be tagged with multiple (ambiguous)
|
...
- entry IDs.
- Additional fielded data can be added to the record
|
...
- ; as needed by downstream processes.
|
Fields
...
- Typically this is an identifier with meaning to the larger application which is using the Language Processing Toolkit.
Parameter |
---|
summary | An ID normally refering the ID of a database, a document, an API key, not necessary unique |
---|
name | id |
---|
required | true |
---|
|
Parameter |
---|
summary | Tag which will identify any match in the graph, as an interpretation |
---|
name | tag |
---|
required | true |
---|
|
...
...
...
Tip |
---|
Tags are hierarchical representations of the same intent. For example, {city} → {administrative-area} → {geographical-area} |
Parameter |
---|
summary | A list of patterns to match in the content |
---|
|
...
Other, Optional Fields
...
| name | patterns |
---|
type | string array |
---|
required | true |
---|
|
...
- This is the confidence of the entity, in comparison to all of the other entities. Essentially, the likelihood that this entity will be randomly encountered.
Note |
---|
Currently, tokens are separated on simple white-space and punctuation, and then reduced to lowercase. |
What to show the user when browsing this entity |
|
...
Parameter |
---|
summary | Free space to add extra data in any format supported by JSON |
---|
name | fields |
---|
type | json |
---|
|
Include Page |
---|
| Generic Resource Fields |
---|
| Generic Resource Fields |
---|
|
...
Dictionary Index
To improve performance especially for every large databases of entities, the entity dictionary is inverted and indexed.
This currently happens in RAM inside the DictionaryTagger stage. An off-line option for pre-inverting the dictionary will be provided in the future.