Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

...

Match tokens

...

to words in a dictionary then creates new LexItems for the token lemma and/or synonyms if configured.


Operates On:  Lexical Items with TOKEN and possibly other flags as specified below.

Note

This lemmatization does not use rules.


Include Page
Generic Configuration Parameters

...

Generic Configuration Parameters

Configuration Parameters

  • Parameter
    summaryThe resource containing the list of words and relationships
    namedictionary

...

    • If no dictionary is provided, a

...

    • built-in dictionary will be

...

    • used, based on the languageISO3.
  • Parameter
    summaryA list of the relationships to include
    nameinclude

...

  • typestring array
  • Parameter
    summaryA list of the relationships to exclude
    nameexclude
    typestring array
Note

Default dictionary only available in English

  • Parameter
    summaryThe language the lemmatize stage should use. The value needs to be one of the ISO 3 letter language codes
    namelanguageISO3
    • By default, English is always used unless configured otherwise. At the moment, only English (ENG) and Spanish (SPA) are available.

Note

A default dictionary is available in English. Spanish is supported when parameter languageISO3 is configured properly.

Code Block
boundaryFlagstext block split
requiredFlagstoken, semantic tag

...

languagejs

...

title

...

Default Config

...

skipFlags

...

skip
"include" : ["pl", "vf"],

...

"exclude" : ["ob"],

...

"dictionary" : "lemmatize-provider:lemmatize_words"

...

,
"languageISO3":"SPA"

Example Output

Code Block
languagetext

...

...

V--------------------[I am liking this projects very much]--------------------V  

...

^--[I]--V--[am]--V--[liking]--V--[this]--V--[projects]--V--[very]--V--[much]--^  

...

        ^--[be]--^---[like]---^          ^--[project]---^

...

 

...

 

...

Output Flags

Lex-Item Flags:

  • LEMMATIZE- All words

...

  • retrieved will be marked as LEMMATIZE.
  • ALL_LOWER_CASE - All of the characters in the token are lower-case characters.
  • TOKEN - This stage creates a new token.

Vertex Flags:

Info
No vertices are created in this stage

Resource Data
Anchor
Dictionary
Dictionary

The resource data will be a json file with an array of words in a field named wordsThis is when the 'dictionary' parameter is used.

Code Block
languagejs

...

...

"words": [

...

  {

...

    "confidence": 0.0049,

...

    

...

"rel": [
      

...

"wnm",

...

      "sp"

...

    ],
    

...

"from": "encyclopaedia",

...

    "to": "encyclopedia"

...

  },

...

  {

...

    "confidence": 0.0752,

...

    

...

"rel": [
      

...

"wnm",

...

      

...

"sp"
    

...

],
    

...

"from": "word",

...

    "to": "worth"
  }
]


When the 'dictionary' parameter is not used, an embedded Wiktionary file will be used. This file is formatted as a 1 entry json per line:

Code Block
languagejs
themeRDark
titleWiktionary file format
{"confidence":0,"rel":["syn"],"from":"japonés","to":"nipón"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"germano"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"tedesco"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"teutón"}
{"confidence":0,"rel":["syn"],"from":"alemán","to":"gringo"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"guainica"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"maisito"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"mayito"}
{"confidence":0,"rel":["syn"],"from":"mayo","to":"turpial de sureste"}

...

{"confidence":0,"rel":["syn"],"from":"domingo","to":"paga"}

Relationships
Anchor
Relationships
Relationships

The required fields for each entry are:

  • from - Original word to search for.

    ...

      • This field will be eliminated

    ...

      • after it is added to the entities of the LexItem.
    • to - Resulting

    ...

    • word. 

      ...

        • It will be a new LexItem on its own.
      • rel - List of relationships between the original word and the resulting word.
        • List of relationships in the default dictionary:
          • pl - pluralization
          • vf - verb form
          • ob - obsolete
          • syn - synonym
          • alt - alternative
          • wwm - word with meaning (more than one)
          • wnm - word no meaning (no additional meaning)

      ...