You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

Lemmatize tokens matched to words in a dictionary.

Operates On:  Lexical Items with TOKEN

This lemmatization does not use rules

Configuration Parameters

  • dictionary(string, optional) - The resource containing the list of words and relationships

    • if no dictionary is provided a default dictionary will be use
  • include(list, optional) - A list of the relationships to include

  • exclude(list, optional) - A list of the relationships to exclude

  • skipFlags (string array, optional) - Flags to be skipped by this stage
    • Tokens marked with this flags will be ignore by this stage, and no process will be performed.
  • requiredFlags (string array, optional)
    • Tokens need to have all the specified flags, in order to be processed
  • debug (boolean, optional)
    • Enable all debug log functionality of the stage, if any.


Default dictionary only available in English


Example Configuration
{
  "type": "LemmatizeStage",
  "include" : ["pl", "vf"],
  "exclude" : ["ob"],
  "dictionary" : "lemmatize-provider:lemmatize_words"
}


Example Output

  V--------------------[I am liking this projects very much]--------------------V  
  ^--[I]--V--[am]--V--[liking]--V--[this]--V--[projects]--V--[very]--V--[much]--^  
          ^--[be]--^---[like]---^          ^--[project]---^  

am - {"confidence":0.0084,"rel":["vf","wnm"],"to":"be"}
liking - {"confidence":0.0084,"rel":["vf","wnm"],"to":"like"}
projects - {"confidence":0.012,"rel":["vf","wnm","pl"],"to":"project"}

Output Flags

Lex-Item Flags:

  • LEMMATIZE- All words retrived will be marked as LEMMATIZE

Resource Data

The resource data will be a json file with an array of words in a field named words

{
  "words": [
    {
      "confidence": 0.0049,
      "rel": [
        "wnm",
        "sp"
      ],
      "from": "encyclopaedia",
      "to": "encyclopedia"
    },
    {
      "confidence": 0.0752,
      "rel": [
        "wnm",
        "sp"
      ],
      "from": "word",
      "to": "worth"
    }
  ]
}

Relationships

The required fields for each entry are:

  • from - Original word to search for
    • this field will be eliminated once added to the entities of the LexItem
  • to - Resulting word 
    • it will be a new LexItem on its own
  • rel - List of relationships between the original word and the resulting word
    • List of relationships in the default dictionary:
      • pl - pluralization
      • vf - verb form
      • ob - obsolete
      • syn - synonym
      • alt - alternative
      • wwm - word with meaning (more than one)
      • wnm - word no meaning (no additional meaning)

Any other field will be included in the entities of the LexItem


  • No labels