You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Looks up sequences of tokens in a dictionary and then tags the sequence with one or more semantic tags as an alternative representation(s). Typically these tags represent entities such as {person}, {place}, {company}, etc.

Note that all possibilities are tagged, including overlaps and sub-patterns, with the expectation that later disambiguation stages will choose which tags are the correct interpretation.

Operates On:  Lexical Items with TOKEN and possibly other flags as specified below.

Configuration

  • dictionary (string, required) - The dictionary resource which holds the names and to be located in the text.
    • This is specified as "provider:name" in the standard resource format (INSERT LINK HERE).
  • required (string, optional) - Only process the tokens with the specified flags.
    • A JSON array of strings, such as ["TOKEN", "ALL_LOWER_CASE"]
Example Configuration
{
 "type":"DictionaryTagger",
 "dictionary":"dict-provider:people-lowercase",
 "required":["TOKEN", "ALL_LOWER_CASE"]
}

Note that the "people-lowercase" resource must be in the format as specified below.

Example Output

In the following example, "abraham lincoln" is in the dictionary as a person, "lincoln" as a place,  and "macaroni", "cheese" and "macaroni and cheese" are all specified as foods:


V--------------[abraham lincoln likes macaroni and cheese]--------------------V
^--[abraham]--V--[lincoln]--V--[likes]--V--[macaroni]--V--[and]--V--[cheese]--^
              ^---{place}---^           ^----{food}----^         ^---{food}---^
^----------{person}---------^           ^-----------------{food}--------------^

Output Flags

Lex-Item Flags:

  • SEMANTIC_TAG - Identifies all lexical items which are semantic tags.

Resource Data

The dictionary tagger must have an "entity dictionary" (a string to JSON map) which is a list of JSON records, indexed by entity ID. In addition, there may also be a pattern map and a token index.

Entity Dictionary Format

The only file which is absolutely required is the entity dictionary. It is a series of JSON records, typically indexed by entity ID.

Each JSON record represents an entity. The format is as follows:

Entity JSON Format
{
  "id":"Q28260",
  "tags":["{city}", "{administrative-area}", "{geography}"],
  "patterns":[
    "Lincoln", "Lincoln, Nebraska", "Lincoln, NE"
  ],
  "confidence":0.95
  
  . . . additional fields as needed go here . . . 
}

Fields

  • id (required, string) - Identifies the entity by unique ID. This identifier must be unique across all entities (across all dictionaries) regardless of the type.
    • Typically this is an identifier with meaning to the larger application which is using the Language Processing Toolkit.
  • tags (required, array of string) - 
  • patterns (required, array of string) - 
  • confidence (optional, float) - 

Other, Optional Fields

  • display (optional, string) - 
  • context (optional, object) - 

Pattern Map

Token Index



  • No labels