Lemmatize Stage

Lemmatize tokens matched to words in a dictionary.

Operates On: Lexical Items with TOKEN

This lemmatization does not use rules

Configuration Parameters

dictionary(string, optional) - The resource containing the list of words and relationships
- if no dictionary is provided a default dictionary will be use
include(list, optional) - A list of the relationships to include
exclude(list, optional) - A list of the relationships to exclude
skipFlags (string array, optional) - Flags to be skipped by this stage
- Tokens marked with this flags will be ignore by this stage, and no process will be performed.
requiredFlags (string array, optional)
- Tokens need to have all the specified flags, in order to be processed
debug (boolean, optional)
- Enable all debug log functionality of the stage, if any.
languageISO3 (string, optional) - The language you need the lemmatize stage use. The value needs to be one of the ISO 3 letter language codes.
- By default, English is always used unless configured otherwise. At the moment only English (ENG) and Spanish (SPA) are available.

Default dictionary available in English

Spanish is supported when parameter languageISO3 is properly configured

Example Configuration

{
  "type": "LemmatizeStage",
  "include" : ["pl", "vf"],
  "exclude" : ["ob"],
  "dictionary" : "lemmatize-provider:lemmatize_words",
  "languageISO3":"SPA"
}

Example Output

  V--------------------[I am liking this projects very much]--------------------V  
  ^--[I]--V--[am]--V--[liking]--V--[this]--V--[projects]--V--[very]--V--[much]--^  
          ^--[be]--^---[like]---^          ^--[project]---^  

am - {"confidence":0.0084,"rel":["vf","wnm"],"to":"be"}
liking - {"confidence":0.0084,"rel":["vf","wnm"],"to":"like"}
projects - {"confidence":0.012,"rel":["vf","wnm","pl"],"to":"project"}

Output Flags

Lex-Item Flags:

LEMMATIZE- All words retrived will be marked as LEMMATIZE

Resource Data

The resource data will be a json file with an array of words in a field named words. This when the 'dictionary' parameter is used.

{
  "words": [
    {
      "confidence": 0.0049,
      "rel": [
        "wnm",
        "sp"
      ],
      "from": "encyclopaedia",
      "to": "encyclopedia"
    },
    {
      "confidence": 0.0752,
      "rel": [
        "wnm",
        "sp"
      ],
      "from": "word",
      "to": "worth"
    }
  ]
}

When the 'dictionary' parameter is not used an embedded wiktionary file will be used. This file is formatted as a 1 entry json per line:

Wiktionary file format

{"confidence":0,"rel":["syn"],"from":"japonés","to":"nipón"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"germano"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"tedesco"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"teutón"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"gringo"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"guainica"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"maisito"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"mayito"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"turpial de sureste"}
{"confidence":0,"rel":["syn"],"from":"domingo","to":"paga"}

Relationships

The required fields for each entry are:

from - Original word to search for
- this field will be eliminated once added to the entities of the LexItem
to - Resulting word
- it will be a new LexItem on its own
rel - List of relationships between the original word and the resulting word
- List of relationships in the default dictionary:
  - pl - pluralization
  - vf - verb form
  - ob - obsolete
  - syn - synonym
  - alt - alternative
  - wwm - word with meaning (more than one)
  - wnm - word no meaning (no additional meaning)

Any other field will be included in the entities of the LexItem

Page tree