You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Saga Lemmatizer stage uses a dictionary file to identify words and reduce them to their basic form. For example it will identify that "running" is a verb form (gerund) and will reduce it to its base infinitive form "run".

The Lemmatizer will also expand the word to its synonyms in case there are synonyms available in the dictionary.

An important note here is that Saga Lemmatizer Stage code is agnostic to what reductions or expansions it uses. The set of reductions and expansions which are called Relationship Types are defined in the dictionary file and in the Lemmatizer configuration. 

This means that when creating the dictionary file for each language, we need to be careful to understand which relationship types we want the lemmatizer to use and make sure to include them in the dictionary file.

In order to produce the dictionary file an internal tool called "Wiktionary Dump Parser" is used.  In addition this tool uses an open source library called "Java Wiktionary Library" or "JWKTL" that helps with the parsing of the Wiktionary.

In the following diagram you can see different components and different stages the tool uses to produce a dictionary file:



Stage 1: Wiktionary dump parser tool, using JWKL, will parse the language specific Wiktionary dump file and produce an index in Oracle Berkeley DB format.

Stage 2: Wiktionary dump parser tool, reads the contents in the index, transform them and stores entries in a Mongo DB collection.

Stage 3: Wiktionary dump parser tool, reads and parses entries stored in Mongo DB and produces a JSON file in a format that SAGA lemmatizer can use.


On the following steps you will know where to get Wiktionary dump files, what code you need to add/modify to support an additional language and how to test your new generated dictionary file.

Step-by-step guide

Step 1: Download Wiktionary dump file

Dump files are periodically generated as a back up in the Wikimedia site. You can see a list of all Wiki backups including Wiktionary for each language at this URL: https://dumps.wikimedia.org/backup-index.html

In order to find your language, use the ISO 2 letter language code plus the word 'wiktionary'. So for example:

For English: enwiktionary

For Spanish: eswiktionary

For German: dewiktionary

And so on...

Here a useful list of language codes: https://www.loc.gov/standards/iso639-2/php/code_list.php


For Spanish you should see something like this:


Once you know it exists, you can click on the link to load a landing page for the language, or even simpler,  go directly to the file list page using this URL: https://dumps.wikimedia.org/eswiktionary/latest/

Note the word 'eswiktionary' is used in the URL, you can change that for you desired language. For example 'dewiktionary' in case you want the German one.

When the page gets loaded look for a file that ends in "-latest-pages-articles.xml.bz2" and download it. For example, for Spanish the file name is "eswiktionary-latest-pages-articles.xml.bz2"

Once downloaded just unzip it and you are ready to go.


Step 2: Get familiar with Wiktionary format and specific templates for you language

As you may already know, Wiktionary is an multilingual project to create free content dictionary of all words in all languages.

Each word you find in Wiktionary is called a page/article/entry and it is manually added by volunteers around the globe.

All languages are different and they have different communities of people contributing so the format used to write these entries varies a lot between languages.

This means that before implementing a new language for SAGA Lemmatizer it is a good idea to get familiar with the specifics of the language.

2.1 Inspecting an Entry

First step is to load wiktionary in your browser: https://www.wiktionary.org/

Then choose your desired language, for this guide we'll use the Spanish one: https://es.wiktionary.org/wiki/Wikcionario:Portada

Then look for a word, for example: 'casa' (spanish for home). Once the word page is found and displayed, click on the Edit tab:



You should see the entry definition, like this:


In general and for all languages:

  • Portions that appear enclosed in curly brackets are called templates. Templates have a name and usually none to N parameters which can be named parameters or numbered parameters (have no name but need to appear in order)
  • When a sequence of equal character appears 2, 3 or more times, it usually denotes the start of a new section. In the image above they start the Word Sense section where it is defined the word form (a noun in this case) and all different meanings the word has.


For more information  







 

  • No labels