Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

Lemmatize tokens are matched to words in a dictionary.


Operates On:  Lexical Items with TOKEN

Note

This lemmatization does not use rules.


Include Page
Generic Configuration Parameters
Generic Configuration Parameters

Configuration Parameters

...

  • dictionary (string, optional) - The resource containing the list of words and relationships.

...

    • If no dictionary is provided, a default dictionary will be

...

    • used.

...

...

  • exclude (list, optional) - A list of the relationships to exclude.

  • languageISO3 (string, optional) - The language

...

  • the lemmatize stage should use. The value needs to be one of the ISO 3 letter language codes.
    • By default, English is always used unless configured otherwise. At the moment, only English (ENG) and Spanish (SPA) are available.


Note

...

A default dictionary is available in English

...

. Spanish is supported when parameter languageISO3 is configured properly

...

.

Code Block
languagejs
themeEclipse
titleExample Configuration
{
  "type": "Lemmatize",
  "include" : ["pl", "vf"],
  "exclude" : ["ob"],
  "dictionary" : "lemmatize-provider:lemmatize_words",
  "languageISO3":"SPA"
}

Example Output

Code Block
languagetext
themeFadeToGrey
  V--------------------[I am liking this projects very much]--------------------V  
  ^--[I]--V--[am]--V--[liking]--V--[this]--V--[projects]--V--[very]--V--[much]--^  
          ^--[be]--^---[like]---^          ^--[project]---^  

am - {"confidence":0.0084,"rel":["vf","wnm"],"to":"be"}
liking - {"confidence":0.0084,"rel":["vf","wnm"],"to":"like"}
projects - {"confidence":0.012,"rel":["vf","wnm","pl"],"to":"project"}

Output Flags

Lex-Item Flags

...

  • LEMMATIZE- All words

...

  • retrieved will be marked as LEMMATIZE.
  • ALL_LOWER_CASE - All of the characters in the token are lower-case characters.
  • TOKEN - This stage creates a new token.

Resource Data

The resource data will be a json file with an array of words in a field named words. This is when the 'dictionary' parameter is used.

Code Block
languagejs
themeEclipse
{
  "words": [
    {
      "confidence": 0.0049,
      "rel": [
        "wnm",
        "sp"
      ],
      "from": "encyclopaedia",
      "to": "encyclopedia"
    },
    {
      "confidence": 0.0752,
      "rel": [
        "wnm",
        "sp"
      ],
      "from": "word",
      "to": "worth"
    }
  ]
}


When the 'dictionary' parameter is not

...

used, an embedded

...

Wiktionary file will be used. This file is formatted as a 1 entry json per line:

Code Block
languagejs
themeEclipse
titleWiktionary file format
{"confidence":0,"rel":["syn"],"from":"japonés","to":"nipón"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"germano"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"tedesco"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"teutón"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"gringo"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"guainica"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"maisito"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"mayito"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"turpial de sureste"}
{"confidence":0,"rel":["syn"],"from":"domingo","to":"paga"}

Relationships
Anchor
Relationships
Relationships

The required fields for each entry are:

  • from - Original word to search for.

    ...

      • This field will be eliminated

    ...

      • after it is added to the entities of the LexItem.
    • to - Resulting

    ...

    • word. 

      ...

        • It will be a new LexItem on its own.
      • rel - List of relationships between the original word and the resulting word.
        • List of relationships in the default dictionary:
          • pl - pluralization
          • vf - verb form
          • ob - obsolete
          • syn - synonym
          • alt - alternative
          • wwm - word with meaning (more than one)
          • wnm - word no meaning (no additional meaning)


      Tip

      Any other field will be included in the entities of the LexItem.