Looks up sequences of tokens in a dictionary and then tags the sequence with one or more semantic tags as an alternative representation(s). Typically these tags represent entities such as {person}, {place}, {company}, etc.
Note that all possibilities are tagged, including overlaps and sub-patterns, with the expectation that later disambiguation stages will choose which tags are the correct interpretation.
Operates On: Lexical Items with TOKEN and possibly other flags as specified below.
TO DO: CONFIGURATION TO ALLOW FOR A SIMPLE LIST OF PHRASES WITH A SINGLE SEMANTIC TAG FOR EVERYTHING (NO ENTITY IDs)
Code Block | ||||
---|---|---|---|---|
| ||||
{ "type":"DictionaryTagger", "dictionary":"dict-provider:people-lowercase", "required":["TOKEN", "ALL_LOWER_CASE"] } |
Note that the "people-lowercase" resource must be in the format as specified below.
In the following example, "abraham lincoln" is in the dictionary as a person, "lincoln" as a place, and "macaroni", "cheese" and "macaroni and cheese" are all specified as foods:
V--------------[abraham lincoln likes macaroni and cheese]--------------------V
^--[abraham]--V--[lincoln]--V--[likes]--V--[macaroni]--V--[and]--V--[cheese]--^
^---{place}---^ ^----{food}----^ ^---{food}---^
^----------{person}---------^ ^-----------------{food}--------------^
The dictionary tagger must have an "entity dictionary" (a string to JSON map) which is a list of JSON records, indexed by entity ID. In addition, there may also be a pattern map and a token index.
The only file which is absolutely required is the entity dictionary. It is a series of JSON records, typically indexed by entity ID.
Each JSON record represents an entity. The format is as follows:
Code Block | ||||
---|---|---|---|---|
| ||||
{ "id":"Q28260", "tags":["{city}", "{administrative-area}", "{geography}"], "patterns":[ "Lincoln", "Lincoln, Nebraska", "Lincoln, NE" ], "confidence":0.95 . . . additional fields as needed go here . . . } |
To improve performance especially for every large databases of entities, the entity dictionary is inverted and indexed.