You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Lemmatize tokens matched to words in a dictionary.

Operates On:  Lexical Items with TOKEN


This lemmatization does not use rules

Configuration Parameters

  • dictionary(string, optional) - The resource containing the list of words and relationships

    • if no dictionary is provided a default dictionary will be use
  • include(list, optional) - A list of the relationships to include

  • exclude(list, optional) - A list of the relationships to exclude

  • skipFlags (string array, optional) - Flags to be skipped by this stage
    • Tokens marked with this flags will be ignore by this stage, and no process will be performed.
  • requiredFlags (string array, optional)
    • Tokens need to have all the specified flags, in order to be processed
  • debug (boolean, optional)
    • Enable all debug log functionality of the stage, if any.
  • languageISO3 (string, optional) - The language you need the lemmatize stage use. The value needs to be one of the ISO 3 letter language codes.
    • By default, English is always used unless configured otherwise. At the moment only English (ENG) and Spanish (SPA) are available.


Default dictionary available in English

Spanish is supported when parameter languageISO3 is properly configured

Example Configuration
{
  "type": "LemmatizeStage",
  "include" : ["pl", "vf"],
  "exclude" : ["ob"],
  "dictionary" : "lemmatize-provider:lemmatize_words",
  "languageISO3":"SPA"
}


Example Output

  V--------------------[I am liking this projects very much]--------------------V  
  ^--[I]--V--[am]--V--[liking]--V--[this]--V--[projects]--V--[very]--V--[much]--^  
          ^--[be]--^---[like]---^          ^--[project]---^  

am - {"confidence":0.0084,"rel":["vf","wnm"],"to":"be"}
liking - {"confidence":0.0084,"rel":["vf","wnm"],"to":"like"}
projects - {"confidence":0.012,"rel":["vf","wnm","pl"],"to":"project"}

Output Flags

Lex-Item Flags:

  • LEMMATIZE- All words retrived will be marked as LEMMATIZE

Resource Data

The resource data will be a json file with an array of words in a field named words. This when the 'dictionary' parameter is used.

{
  "words": [
    {
      "confidence": 0.0049,
      "rel": [
        "wnm",
        "sp"
      ],
      "from": "encyclopaedia",
      "to": "encyclopedia"
    },
    {
      "confidence": 0.0752,
      "rel": [
        "wnm",
        "sp"
      ],
      "from": "word",
      "to": "worth"
    }
  ]
}


When the 'dictionary' parameter is not used an embedded wiktionary file will be used. This file is formatted as a 1 entry json per line:

Wiktionary file format
{"confidence":0,"rel":["syn"],"from":"japonés","to":"nipón"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"germano"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"tedesco"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"teutón"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"gringo"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"guainica"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"maisito"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"mayito"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"turpial de sureste"}
{"confidence":0,"rel":["syn"],"from":"domingo","to":"paga"}

Relationships

The required fields for each entry are:

  • from - Original word to search for
    • this field will be eliminated once added to the entities of the LexItem
  • to - Resulting word 
    • it will be a new LexItem on its own
  • rel - List of relationships between the original word and the resulting word
    • List of relationships in the default dictionary:
      • pl - pluralization
      • vf - verb form
      • ob - obsolete
      • syn - synonym
      • alt - alternative
      • wwm - word with meaning (more than one)
      • wnm - word no meaning (no additional meaning)

Any other field will be included in the entities of the LexItem



  • No labels