Saga Lemmatize stage uses a dictionary file to identify words and reduce them to their basic form. For example it will identify that "running" is a verb form (gerund) and will reduce it to its base infinitive form "run".

The Lemmatizer will also expand the word to its synonyms in case there are synonyms available in the dictionary. It could also expand antonyms, alternative forms, etc.

An important note here is that Saga Lemmatizer Stage code is agnostic to what reductions or expansions it uses. The set of reductions and expansions (Relationship Types) to use are defined in the dictionary file and in the Lemmatize Stage configuration.

This means that when creating the dictionary file for each language, we need to be careful to understand which relationship types we want the lemmatizer to use and make sure to include them in the dictionary file.

In order to produce the dictionary file an internal tool called "Wiktionary Dump Parser" is used. This tool uses an open source library called "Java Wiktionary Library" (JWKTL) that makes the parsing of the Wiktionary dump file much easier.

In the following diagram you can see different components and the different stages the tool uses to produce a dictionary file:

Wiktionary dump parser tool will:

Stage 1: using JWKL, parse the language specific Wiktionary dump file and produce an index in Oracle Berkeley DB format.

Stage 2: read the contents of the index, transform them and store entries in a Mongo DB collection.

Stage 3: read and parse entries stored in Mongo DB then produce a file in a JSON entry per line format that SAGA lemmatizer can use.

On following steps you will know where to get Wiktionary dump files, what code you need to add/modify to support an additional language and how to test your new generated dictionary file.

Step-by-step guide

Step 1: Download Wiktionary dump file

Dump files are periodically generated as a back up in the Wikimedia site. You can see a list of all Wiki backups including Wiktionary for each language at this URL: https://dumps.wikimedia.org/backup-index.html

In order to find your language, use the ISO 2 letter language code plus the word 'wiktionary'. So for example:

For English: enwiktionary

For Spanish: eswiktionary

For German: dewiktionary

And so on...

Here a useful list of language codes: https://www.loc.gov/standards/iso639-2/php/code_list.php

For Spanish you should see something like this:

Once you know it exists, click on the link to load a landing page for the language or simply go directly to the file list page using this URL: https://dumps.wikimedia.org/eswiktionary/latest/

Note the word 'eswiktionary' is used in the URL, you can change that for you desired language. For example 'dewiktionary' in case you want German.

When the page loads, look for a file that ends in "-latest-pages-articles.xml.bz2" and download it. For example, the Spanish file name is "eswiktionary-latest-pages-articles.xml.bz2"

Once downloaded just unzip it and you are ready to go.

Step 2: Get familiar with Wiktionary format and specific templates for you language

As you may already know, Wiktionary is an multilingual project to create free content dictionary of all words in all languages.

Each word you find in Wiktionary is called a page/article/entry and it is manually added by volunteers around the globe.

Obviously, languages are different, communities contributing are different, therefore format used to write these entries varies a lot between languages. They could even change between entries in the same language, depending on who did it and how old the entry is.

This means that before implementing a new language for SAGA Lemmatize Stage, it is a good idea to get familiar with the specifics of the language.

2.1 Inspecting an Entry

First step is to load wiktionary in your browser: https://www.wiktionary.org/

Then choose your desired language, for this guide we'll use the Spanish one as example: https://es.wiktionary.org/wiki/Wikcionario:Portada

Then look for a word, for example: 'casa' (spanish for home). Once the word page is displayed, click on the Edit tab:

You should see the entry definition, like this:

In general, for all languages:

Portions that appear enclosed in curly brackets are called templates. Templates have a name and usually none to N parameters which can be named parameters or numbered parameters (have no name but need to appear in order)
When there is a sequence of 2 to 4 equal characters, it usually means it is the start of a new section. In the image above the section Word Sense is started with === sustantivo femenino|es ===. This section defines the word form (a noun in this case) and all different meanings the word has.

2.2 Learning more about Wiktionary

For more information on templates use the Help in the Wiktionary site. For example:

English help: https://en.wiktionary.org/wiki/Help:Contents
English templates: https://en.wiktionary.org/wiki/Wiktionary:Templates
Spanish help: https://es.wiktionary.org/wiki/Wikcionario:Ayuda
Spanish template listing: https://es.wiktionary.org/wiki/Especial:Todas?from=es.v.conj&to=&namespace=10

Your desired language should have a similar page to the Spanish template listing.

This listing is very useful when implementing a new language because you can get a list of all templates used and do searches.

Step 3: Get familiar with JWKTL and add support for your language

Wiktionary was created to be understood by humans which means data is not available in a structured way easily consumable by computers.

Consuming data from Wiktionary could be of value in many projects including Natural Language Processing (NLP). Even though there are companies that sell dictionaries, prices are usually high. This is why it makes sense to parse Wiktionary data and create a dictionary that we can use in our projects.

Java Wiktionary Library (JWKTL) is an open source project aimed to ease the parsing of Wiktionary data. Check out their site here: https://dkpro.github.io/dkpro-jwktl/

Currently JWKTL only supports English, German and Russian languages by default.

At Accenture, we added Spanish support and you can get the source code from Git: https://source.digital.accenture.com/projects/ST/repos/saga-jwktl/browse

For Spanish, we focused on the bare minimum SAGA needs to work. If you want to do the same, it may be a good idea to base your new language on the Spanish parser (copy, paste and rename files). If you want to implement a more complete version of the parser then the English parser is a better option.

Following image shows the structure of the JWKTL project. Notice there is a folder for each language. You'll need to add a new folder for your desired language.

3.1 How JWKTL works?

It will detect the language of the dump file and use the correct parser for the language detected.
For each entry in the dump file:

For each line in the entry

It will iterate over the list of handlers looking for a handler that can handle the current one. So for example, if the current line has the pattern of a new section and the section is Etymology, then the EtymologyHandler will process and extract the information from the line.
A section is usually conformed by several lines, There is code in place to know if the next line belongs to the same section and needs to be handled by the same handler or if a new section was found and a new handler needs to be found.

Handlers are registered in the WiktionaryEntryParser for each language. The registration order is important, for example SenseHandler needs to be the last one. The recommendation is to follow the same order defined by the English parser for the handlers you are implementing.

3.2 Additional needed changes

In addition to adding a new folder and handlers for your new language you need to add the following changes:

Add your new language as a new static field in this class: src/main/java/de/tudarmstadt/ukp/jwktl/api/util/Language.java

2. Add your new language parser instantiation in the onSiteInfoComplete method in the class: src/main/java/de/tudarmstadt/ukp/jwktl/parser/WiktionaryArticleParser.java

Page tree

Step-by-step guide

Step 1: Download Wiktionary dump file