Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

Match tokens to words in a dictionary then creates new LexItems for the token lemma and/or synonyms if configured.


Operates On:  Lexical Items with TOKEN and possibly other flags as specified below.

Note

This lemmatization does not use rules.


Include Page
Generic Configuration Parameters
Generic Configuration Parameters

Configuration Parameters

  • Parameter
    summaryThe resource containing the list of words and relationships
    namedictionary
    • If no dictionary is provided, a built-in dictionary will be used, based on the languageISO3.
  • Parameter
    summaryA list of the relationships to include
    nameinclude
    typestring array
  • Parameter
    summaryA list of the relationships to exclude
    nameexclude
    typestring array
  • Parameter
    summaryThe language the lemmatize stage should use. The value needs to be one of the ISO 3 letter language codes
    namelanguageISO3
    • By default, English is always used unless configured otherwise. At the moment, only English (ENG) and Spanish (SPA) are available.

Note

A default dictionary is available in English. Spanish is supported when parameter languageISO3 is configured properly.

Saga_config_stagecode
boundaryFlagstext block split
requiredFlagstoken, semantic tag
languagejs
titleDefault Config
skipFlagsskip
"include" : ["pl", "vf"],
"exclude" : ["ob"],
"dictionary" : "lemmatize-provider:lemmatize_words",
"languageISO3":"SPA"

Example Output

saga_graph
Code Block
languagetext
V--------------------[I am liking this projects very much]--------------------V  
^--[I]--V--[am]--V--[liking]--V--[this]--V--[projects]--V--[very]--V--[much]--^  
        ^--[be]--^---[like]---^          ^--[project]---^  

Output Flags

Lex-Item Flags:

  • LEMMATIZE- All words retrieved will be marked as LEMMATIZE.
  • ALL_LOWER_CASE - All of the characters in the token are lower-case characters.
  • TOKEN - This stage creates a new token.

Vertex Flags:

Info
No vertices are created in this stage

Resource Data
Anchor
Dictionary
Dictionary

The resource data will be a json file with an array of words in a field named words. This is when the 'dictionary' parameter is used.

saga_json
Code Block
languagejs
"words": [
  {
    "confidence": 0.0049,
    "rel": [
      "wnm",
      "sp"
    ],
    "from": "encyclopaedia",
    "to": "encyclopedia"
  },
  {
    "confidence": 0.0752,
    "rel": [
      "wnm",
      "sp"
    ],
    "from": "word",
    "to": "worth"
  }
]


When the 'dictionary' parameter is not used, an embedded Wiktionary file will be used. This file is formatted as a 1 entry json per line:

Code Block
languagejs
themeRDark
titleWiktionary file format
{"confidence":0,"rel":["syn"],"from":"japonés","to":"nipón"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"germano"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"tedesco"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"teutón"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"gringo"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"guainica"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"maisito"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"mayito"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"turpial de sureste"}
{"confidence":0,"rel":["syn"],"from":"domingo","to":"paga"}

Relationships
Anchor
Relationships
Relationships

The required fields for each entry are:

  • from - Original word to search for.
    • This field will be eliminated after it is added to the entities of the LexItem.
  • to - Resulting word. 
    • It will be a new LexItem on its own.
  • rel - List of relationships between the original word and the resulting word.
    • List of relationships in the default dictionary:
      • pl - pluralization
      • vf - verb form
      • ob - obsolete
      • syn - synonym
      • alt - alternative
      • wwm - word with meaning (more than one)
      • wnm - word no meaning (no additional meaning)