...
...
...
Operates On: Lexical Items with TOKEN and possibly other flags as specified below.
Configuration
...
Include Page |
---|
| Generic Configuration Parameters |
---|
| Generic Configuration Parameters |
---|
|
Configuration Parameters
Parameter |
---|
summary | The selected spellchecking algorithm. |
---|
default | "Levenshtein" |
---|
name | algori |
---|
|
Parameter |
---|
summary | Extra parameters for the selected spellchecking algorithm. |
---|
name | algorithm_params |
---|
|
Parameter |
---|
summary | The dictionary resource that holds the names and that is to be located in the text |
---|
|
...
| name | dictionary |
---|
required | true |
---|
|
- This is specified as "provider:name" in the standard resource format
...
- A JSON array of strings, such as ["TOKEN", "ALL_LOWER_CASE"]
...
- Tokens marked with this flags will be ignore by this stage, and no process will be performed.
...
language | js |
---|
title | Example Configuration |
---|
Parameter |
---|
summary | Ignore matches with tags specified in the ignoreTags list |
---|
name | dontProcessTags |
---|
type | string array |
---|
|
Parameter |
---|
summary | Name of the tag which indicates what entities should be process |
---|
name | entity |
---|
|
Parameter |
---|
summary | Removes accents and diacritics and generates a new pattern |
---|
default | false |
---|
name | normalizeAccents |
---|
|
Parameter |
---|
summary | Indicates if characters should be removed from the pattern using a list creating a new pattern |
---|
default | false |
---|
name | removeChars |
---|
type | boolean |
---|
|
Parameter |
---|
summary | List of characters to remove from the pattern |
---|
default | _-‿⁀⁔︳︴﹍﹎﹏_ |
---|
name | charsList |
---|
|
Parameter |
---|
summary | Threshold use to filter when doing vector similarity |
---|
default | 0.7 |
---|
name | cosineSimThreshold |
---|
type | double |
---|
|
Parameter |
---|
summary | Activate spellchecking to find entities misspelled. |
---|
default | false |
---|
name | spellchecking |
---|
type | boolean |
---|
|
Parameter |
---|
summary | Activate matching based on Total Match. |
---|
default | false |
---|
name | matchAll |
---|
type | boolean |
---|
|
Parameter |
---|
summary | Optional threshold to match based on coverage of text matched. |
---|
default | 1.0 |
---|
name | matchAllThreshold |
---|
type | double |
---|
|
Code Block |
---|
boundaryFlags | text block split |
---|
stage | DictionaryTagger |
---|
requiredFlags | token, semantic tag |
---|
language | js |
---|
skipFlags | skip |
---|
|
"algorithm": "Levenshtein"
"algorithm_params": {}
"dictionary": |
...
"dict-provider:people-lowercase",
|
...
...
...
...
currency"],
"normalizeAccents": false,
"removeChars": false,
"charsList": "_-‿⁀⁔︳︴﹍﹎﹏_"
"spellchecking": false
"cosineSimThreshold": 0.7
"lowercase": true
"matchAll": false
"matchAllThreshold": 1.0 |
"people-lowercase" resource must be in the format |
...
Example Output
In the following example, "abraham lincoln" is in the dictionary as a person, "lincoln" as a place, and "macaroni", "cheese" and "macaroni and cheese" are all specified as foods:
Code Block |
---|
|
V--------------[abraham lincoln likes macaroni and cheese]--------------------V |
...
^--[abraham]--V--[lincoln]--V--[likes]--V--[macaroni]--V--[and]--V--[cheese]--^ |
...
...
...
...
...
...
...
...
...
...
]--------^ ^---------------- |
...
...
Output Flags
Lex-Item Flags
...
- SEMANTIC_TAG - Identifies all lexical items
...
- that are semantic tags.
- PARTIAL_MATCH - Identifies partial matches of patterns.
- TOTAL_MATCH - Identifies total matches of patterns.
- ENTITY - Identifies the token as an entity.
- MISSPELL - Identifies tokens with errors or misspells.
Vertex Flags:
Info |
---|
No vertices are created in this stage |
Resource Data
The dictionary tagger must have an "entity dictionary" (a string to JSON map) which is a list of JSON records, indexed by entity ID. In addition, there may also be a pattern map and a token index.
The only file
...
that is absolutely required is the entity dictionary. It is a series of JSON records, typically indexed by entity ID.
Each JSON record represents an entity. The format is as follows:
Code Block |
---|
Title | Entity Json Format |
---|
language | js |
---|
|
...
"_id" : "KGAAJGsBemSwA0nZTLXA",
"id":"Q28260",
|
...
...
...
...
"display": "Lincon"
"patterns":[
|
...
...
"Lincoln", "Lincoln, Nebraska", "Lincoln, NE"
|
...
...
"coord": [40.813639, -96.702611]
}
"confAdjust": 0.95 |
...
...
. . . additional fields as needed go here . . . |
...
...
...
- entries can have the same pattern.
|
...
- If the pattern is matched, then it will be tagged with multiple (ambiguous)
|
...
- entry IDs.
- Additional fielded data can be added to the record
|
...
- ; as needed by downstream processes.
|
Fields
...
- Typically this is an identifier with meaning to the larger application which is using the Language Processing Toolkit.
...
Parameter |
---|
summary | An ID normally refering the ID of a database, a document, an API key, not necessary unique |
---|
name | id |
---|
required | true |
---|
|
Parameter |
---|
summary | Tag which will identify any match in the graph, as an interpretation |
---|
name | tag |
---|
required | true |
---|
|
...
...
are hierarchical representations of the same intent. For example, {city} → {administrative-area} → {geographical-area} |
...
Parameter |
---|
summary | A list of patterns to match in the content |
---|
|
...
| name | patterns |
---|
type | string array |
---|
required | true |
---|
|
...
Other, Optional Fields
...
Note |
---|
Currently, tokens are separated on simple white-space and punctuation, and then reduced to lowercase. |
...
- This is the confidence of the entity, in comparison to all of the other entities. Essentially, the likelihood that this entity will be randomly encountered.
What to show the user when browsing this entity |
|
...
Parameter |
---|
summary | Free space to add extra data in any format supported by JSON |
---|
name | fields |
---|
type | json |
---|
|
Include Page |
---|
| Generic Resource Fields |
---|
| Generic Resource Fields |
---|
|
...
Dictionary Index
To improve performance especially for every large databases of entities, the entity dictionary is inverted and indexed.
This currently happens in RAM inside the DictionaryTagger stage. An off-line option for pre-inverting the dictionary will be provided in the future.