Looks up sequences of tokens in a dictionary and then tags the sequence with one or more semantic tags as an alternative representation(s). Typically these tags represent entities such as {person}, {place}, {company}, etc.
Note that all possibilities are tagged, including overlaps and sub-patterns, with the expectation that later disambiguation stages will choose which tags are the correct interpretation.
Operates On: Lexical Items with TOKEN and possibly other flags as specified below.
Configuration
- dictionary (string, required) - The dictionary resource which holds the names and to be located in the text.
- This is specified as "provider:name" in the standard resource format (INSERT LINK HERE).
- requiredFlags (string, optional) - Only process the tokens with the specified flags.
- A JSON array of strings, such as ["TOKEN", "ALL_LOWER_CASE"]
{
"type":"DictionaryTagger",
"dictionary":"dict-provider:people-lowercase",
"requiredFlags":["TOKEN", "ALL_LOWER_CASE"]
}
Note that the "people-lowercase" resource must be in the format as specified below.
Example Output
In the following example, "abraham lincoln" is in the dictionary as a person, "lincoln" as a place, and "macaroni", "cheese" and "macaroni and cheese" are all specified as foods:
V--------------[abraham lincoln likes macaroni and cheese]--------------------V
^--[abraham]--V--[lincoln]--V--[likes]--V--[macaroni]--V--[and]--V--[cheese]--^
^---{place}---^ ^----{food}----^ ^---{food}---^
^----------{person}---------^ ^-----------------{food}--------------^
Output Flags
Lex-Item Flags:
- SEMANTIC_TAG - Identifies all lexical items which are semantic tags.
Resource Data
The dictionary tagger must have an "entity dictionary" (a string to JSON map) which is a list of JSON records, indexed by entity ID. In addition, there may also be a pattern map and a token index.
The only file which is absolutely required is the entity dictionary. It is a series of JSON records, typically indexed by entity ID.
Each JSON record represents an entity. The format is as follows:
{
"id":"Q28260",
"tags":["{city}", "{administrative-area}", "{geography}"],
"patterns":[
"Lincoln", "Lincoln, Nebraska", "Lincoln, NE"
],
"confidence":0.95
. . . additional fields as needed go here . . .
}
Notes
- Multiple entities can have the same pattern.
- If the pattern is matched, then it will be tagged with multiple (ambiguous) entity IDs.
- Additional fielded data can be added to the record
- As needed by downstream processes.
Fields
- id (required, string) - Identifies the entity by unique ID. This identifier must be unique across all entities (across all dictionaries) regardless of the type.
- Typically this is an identifier with meaning to the larger application which is using the Language Processing Toolkit.
- tags (required, array of string) - The list of semantic tags which will be added to the interpretation graph whenever any of the patterns are matched.
- These will all be matched with the SEMANTIC_TAG flag.
- patterns (required, array of string) - A list of patterns to match in the content.
- Patterns will be tokenized and there may be multiple variations which can match.
- (details TBD)
- confidence (optional, float) - Specifies the confidence level of the entity, independent of any patterns matched.
- This is the confidence of the entity, in comparison to all of the other entities. Essentially, the likelihood that this entity will be randomly encountered.
Other, Optional Fields
- display (optional, string) - What to show the user when browsing this entity.
- context (optional, object) - A context vector which can help disambiguate this entity from others with the same pattern.
- Format TBD, but probably a list of weighted words, phrases and tags.
Dictionary Index
To improve performance especially for every large databases of entities, the entity dictionary is inverted and indexed.