Lemmatize Stage

Lemmatize tokens matched to words in a dictionary.

Operates On: Lexical Items with TOKEN

This lemmatization does not use rules

Generic Configuration Parameters

boundaryFlags ( type=string array | optional ) - List of vertex flags that indicate the beginning and end of a text block.
Tokens to process must be inside two vertices marked with this flag (e.g ["TEXT_BLOCK_SPLIT"])
skipFlags ( type=string array | optional ) - Flags to be skipped by this stage.
Tokens marked with this flag will be ignored by this stage, and no processing will be performed.
requiredFlags ( type=string array | optional ) - Lex items flags required by every token to be processed.
Tokens need to have all of the specified flags in order to be processed.
atLeastOneFlag ( type=string array | optional ) - Lex items flags needed by every token to be processed.
Tokens will need at least one of the flags specified in this array.
confidenceAdjustment ( type=double | default=1 | required ) - Adjustment factor to apply to the confidence value of 0.0 to 2.0 from (Applies for every pattern match).
- 0.0 to < 1.0 decreases confidence value
- 1.0 confidence value remains the same
- > 1.0 to 2.0 increases confidence value
debug ( type=boolean | default=false | optional ) - Enable all debug log functionality for the stage, if any.
enable ( type=boolean | default=true | optional ) - Indicates if the current stage should be consider for the Pipeline Manager
- Only applies for automatic pipeline building

Configuration Parameters

dictionary(string, optional) - The resource containing the list of words and relationships
- if no dictionary is provided a default dictionary will be use
include(list, optional) - A list of the relationships to include
exclude(list, optional) - A list of the relationships to exclude
languageISO3 (string, optional) - The language you need the lemmatize stage use. The value needs to be one of the ISO 3 letter language codes.
- By default, English is always used unless configured otherwise. At the moment only English (ENG) and Spanish (SPA) are available.

Default dictionary available in English

Spanish is supported when parameter languageISO3 is properly configured

Example Configuration

{
  "type": "LemmatizeStage",
  "include" : ["pl", "vf"],
  "exclude" : ["ob"],
  "dictionary" : "lemmatize-provider:lemmatize_words",
  "languageISO3":"SPA"
}

Example Output

  V--------------------[I am liking this projects very much]--------------------V  
  ^--[I]--V--[am]--V--[liking]--V--[this]--V--[projects]--V--[very]--V--[much]--^  
          ^--[be]--^---[like]---^          ^--[project]---^  

am - {"confidence":0.0084,"rel":["vf","wnm"],"to":"be"}
liking - {"confidence":0.0084,"rel":["vf","wnm"],"to":"like"}
projects - {"confidence":0.012,"rel":["vf","wnm","pl"],"to":"project"}

Output Flags

Lex-Item Flags:

LEMMATIZE- All words retrived will be marked as LEMMATIZE

Resource Data

The resource data will be a json file with an array of words in a field named words. This when the 'dictionary' parameter is used.

{
  "words": [
    {
      "confidence": 0.0049,
      "rel": [
        "wnm",
        "sp"
      ],
      "from": "encyclopaedia",
      "to": "encyclopedia"
    },
    {
      "confidence": 0.0752,
      "rel": [
        "wnm",
        "sp"
      ],
      "from": "word",
      "to": "worth"
    }
  ]
}

When the 'dictionary' parameter is not used an embedded wiktionary file will be used. This file is formatted as a 1 entry json per line:

Wiktionary file format

{"confidence":0,"rel":["syn"],"from":"japonés","to":"nipón"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"germano"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"tedesco"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"teutón"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"gringo"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"guainica"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"maisito"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"mayito"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"turpial de sureste"}
{"confidence":0,"rel":["syn"],"from":"domingo","to":"paga"}

Relationships

The required fields for each entry are:

from - Original word to search for
- this field will be eliminated once added to the entities of the LexItem
to - Resulting word
- it will be a new LexItem on its own
rel - List of relationships between the original word and the resulting word
- List of relationships in the default dictionary:
  - pl - pluralization
  - vf - verb form
  - ob - obsolete
  - syn - synonym
  - alt - alternative
  - wwm - word with meaning (more than one)
  - wnm - word no meaning (no additional meaning)

Any other field will be included in the entities of the LexItem

Page tree