Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

Lemmatize tokens matched to words in a dictionary.


Operates On:  Lexical Items with TOKEN

Note

This lemmatization does not use rules

Include Page
Generic Configuration Parameters
Generic Configuration Parameters

Configuration Parameters

  • dictionary(string, optional) - The resource containing the list of words and relationships

    • if no dictionary is provided a default dictionary will be use
  • include(list, optional) - A list of the relationships to include

  • exclude(list, optional) - A list of the relationships to exclude

  • languageISO3 (string, optional) - The language you need the lemmatize stage use. The value needs to be one of the ISO 3 letter language codes.
    • By default, English is always used unless configured otherwise. At the moment only English (ENG) and Spanish (SPA) are available.


Note

Default dictionary available in English

Spanish is supported when parameter languageISO3 is properly configured

Code Block
languagejs
themeEclipse
titleExample Configuration
{
  "type": "LemmatizeStage",
  "include" : ["pl", "vf"],
  "exclude" : ["ob"],
  "dictionary" : "lemmatize-provider:lemmatize_words",
  "languageISO3":"SPA"
}


Example Output

Code Block
languagetext
themeFadeToGrey
  V--------------------[I am liking this projects very much]--------------------V  
  ^--[I]--V--[am]--V--[liking]--V--[this]--V--[projects]--V--[very]--V--[much]--^  
          ^--[be]--^---[like]---^          ^--[project]---^  

am - {"confidence":0.0084,"rel":["vf","wnm"],"to":"be"}
liking - {"confidence":0.0084,"rel":["vf","wnm"],"to":"like"}
projects - {"confidence":0.012,"rel":["vf","wnm","pl"],"to":"project"}

Output Flags

Lex-Item Flags:

  • LEMMATIZE- All words retrived will be marked as LEMMATIZE

Resource Data

The resource data will be a json file with an array of words in a field named words. This when the 'dictionary' parameter is used.

Code Block
languagejs
themeEclipse
{
  "words": [
    {
      "confidence": 0.0049,
      "rel": [
        "wnm",
        "sp"
      ],
      "from": "encyclopaedia",
      "to": "encyclopedia"
    },
    {
      "confidence": 0.0752,
      "rel": [
        "wnm",
        "sp"
      ],
      "from": "word",
      "to": "worth"
    }
  ]
}


When the 'dictionary' parameter is not used an embedded wiktionary file will be used. This file is formatted as a 1 entry json per line:

Code Block
languagejs
themeEclipse
titleWiktionary file format
{"confidence":0,"rel":["syn"],"from":"japonés","to":"nipón"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"germano"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"tedesco"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"teutón"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"gringo"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"guainica"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"maisito"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"mayito"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"turpial de sureste"}
{"confidence":0,"rel":["syn"],"from":"domingo","to":"paga"}

Relationships
Anchor
Relationships
Relationships

The required fields for each entry are:

  • from - Original word to search for
    • this field will be eliminated once added to the entities of the LexItem
  • to - Resulting word 
    • it will be a new LexItem on its own
  • rel - List of relationships between the original word and the resulting word
    • List of relationships in the default dictionary:
      • pl - pluralization
      • vf - verb form
      • ob - obsolete
      • syn - synonym
      • alt - alternative
      • wwm - word with meaning (more than one)
      • wnm - word no meaning (no additional meaning)
Tip

Any other field will be included in the entities of the LexItem