Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Saga Lemmatizer stage uses a dictionary file to identify words and reduce word forms them to their basic form. For example it will identify that "running" is a verb form (gerund) and will reduce it to its base infinitive form "run".

The Lemmatizer will also expand the word to synonyms to its synonyms in case there are synonyms available in the dictionary file for a specific word.

An important note here is that Sage Saga Lemmatizer Stage code is agnostic to what reductions or expansions it uses. These series The set of reductions and expansions which are called Relationship types and are Types are defined in the dictionary file and in the Lemmatizer configuration

This means that when creating the dictionary file for each language, we need to be careful to understand which relationship types we want the lemmatizer to use and make sure to include them in the dictionary file.

In order to produce the dictionary file an internal tool called "Wiktionary Dump Parser" is usedIn addition this tool uses an open source library called "Java Wiktionary Library" or "JWKTL" that helps with the parsing of the Wiktionary.

In the following diagram you can see different components and different stages the tool uses to produce a dictionary file:


Image Added


Stage 1: Wiktionary dump parser tool, using JWKL, will parse the language specific Wiktionary dump file and produce an index in Oracle Berkeley DB format.

Stage 2: Wiktionary dump parser tool, reads the contents in the index, transform them and stores entries in a Mongo DB collection.

Stage 3: Wiktionary dump parser tool, reads and parses entries stored in Mongo DB and produces a JSON file in a format that SAGA lemmatizer can use.


On the following steps you will know where to get Wiktionary dump files, what code you need to add/modify to support an additional language and how to test your new generated dictionary file.


Step-by-step guide

Step 1: Download Wiktionary dump file

 




 

Info

Content by Label
showLabelsfalse
max5
spacessaga131
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel in ("wiktionary","english","language","lemmatizer","spanish") and type = "page" and space = "saga131"
labelsLanguage Lemmatizer Wiktionary English Spanish

...