...
...
...
Operates On: Lexical Items with TOKEN and possibly other flags as specified below.
Include Page | ||||
---|---|---|---|---|
|
...
...
...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
{
"type":"DictionaryTagger",
"dictionary":"dict-provider:people-lowercase"
} |
Note that the "people-lowercase" resource must be in the format
...
specified below.
In the following example, "abraham lincoln" is in the dictionary as a person, "lincoln" as a place, and "macaroni", "cheese" and "macaroni and cheese" are all specified as foods:
Code Block | ||||
---|---|---|---|---|
| ||||
V--------------[abraham lincoln likes macaroni and cheese]--------------------V
^--[abraham]--V--[lincoln]--V--[likes]--V--[macaroni]--V--[and]--V--[cheese]--^
^--[{place}]--^ ^---[{food}]---^ ^--[{food}]--^
^---------[{person}]--------^ ^----------------[{food}]-------------^ |
...
...
The dictionary tagger must have an "entity dictionary" (a string to JSON map) which is a list of JSON records, indexed by entity ID. In addition, there may also be a pattern map and a token index.
The only file
...
that is absolutely required is the entity dictionary. It is a series of JSON records, typically indexed by entity ID.
Each JSON record represents an entity. The format is as follows:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
{
"id":"Q28260",
"tags":["{city}", "{administrative-area}", "{geography}"],
"patterns":[
"Lincoln", "Lincoln, Nebraska", "Lincoln, NE"
],
"confidence":0.95
. . . additional fields as needed go here . . .
} |
...
...
...
To improve performance especially for every large databases of entities, the entity dictionary is inverted and indexed.
This currently happens in RAM inside the DictionaryTagger stage. An off-line option for pre-inverting the dictionary will be provided in the future.