Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Saga Lemmatizer Lemmatize stage uses a dictionary file to identify words and reduce them to their basic form. For example it will identify that "running" is a verb form (gerund) and will reduce it to its base infinitive form "run".

The Lemmatizer will also expand the word to its synonyms in case there are synonyms available in the dictionary. It could also expand antonyms, alternative forms, etc.

An important note here is that Saga Lemmatizer Stage code is agnostic to what reductions or expansions it uses. The set of reductions and expansions which are called Relationship Types are (Relationship Types) to use are defined in the dictionary file and in the Lemmatizer configurationLemmatize Stage configuration

This means that when creating the dictionary file for each language, we need to be careful to understand which relationship types we want the lemmatizer to use and make sure to include them in the dictionary file.

In order to produce the dictionary file an internal tool called "Wiktionary Dump Parser" is used.  In addition this This tool uses an open source library called "Java Wiktionary Library" or "(JWKTL" ) that helps with makes the parsing of the Wiktionaryparser of the Wiktionary dump file much easier.

In the following diagram you can see different components and different stages the tool uses to produce a dictionary file:



Wiktionary dump parser tool will:

Stage 1: Wiktionary dump parser tool,   using JWKL, will parse the language specific Wiktionary dump file and produce an index in Oracle Berkeley DB format.

Stage 2: Wiktionary dump parser tool, reads the contents in   read the contents of the index, transform them and stores store entries in a Mongo DB collection.

Stage 3: Wiktionary dump parser tool, reads and parses   read and parse entries stored in Mongo DB and produces then produce a JSON file in a format that JSON entry per line format that SAGA lemmatizer can use.


On the following steps you will know where to get Wiktionary dump files, what code you need to add/modify to support an additional language and how to test your new generated dictionary file.

...

For Spanish you should see something like this:

Image RemovedImage Added


Once you know it exists, you can click on the link to load a landing page for the language , or even simpler,  simply go directly to the file list page using this URL: https://dumps.wikimedia.org/eswiktionary/latest/

Note the word 'eswiktionary' is used in the URL, you can change that for you desired language. For example 'dewiktionary' in case you want the German one.

When the page gets loaded loads, look for a file that ends in "-latest-pages-articles.xml.bz2" and download it. For example, for  the Spanish the file name is "eswiktionary-latest-pages-articles.xml.bz2"

...

Each word you find in Wiktionary is called a page/article/entry and it is manually added by volunteers around the globe.

All Obviously, languages are different and they have different communities of people contributing so the format , communities contributing are different, therefore format used to write these entries varies a lot between languages. They could even change between entries in the same language, depending on who did it and how old the entry is.

This means that before implementing a new language for SAGA Lemmatizer it Lemmatize Stage, it is a good idea to get familiar with the specifics of the language.

...

Then choose your desired language, for this guide we'll use the Spanish one as example: https://es.wiktionary.org/wiki/Wikcionario:Portada

Then look for a word, for example: 'casa' (spanish for home). Once the word page is found and displayedis displayed, click on the Edit tab:


Image RemovedImage Added


You should see the entry definition, like this:

Image Modified


In general and , for all languages:

  • Portions that appear enclosed in curly brackets are called templates. Templates have a name and usually none to N parameters which can be named parameters or numbered parameters (have no name but need to appear in order)
  • When there is a sequence of equal character appears 2, 3 or more timesof  2 to 4 equal characters, it usually denotes means it is the start of a new section. In the image above they start the above the section Word Sense section where it is defined is started with === sustantivo femenino|es ===. This section defines the word form (a noun in this case) and all different meanings the word has.

For more information  

 

...


2.2 Learning more about Wiktionary

For more information on templates use the Help in the Wiktionary site. For example:

  1. English help: https://en.wiktionary.org/wiki/Help:Contents
  2. English templates: https://en.wiktionary.org/wiki/Wiktionary:Templates
  3. Spanish help: https://es.wiktionary.org/wiki/Wikcionario:Ayuda
  4. Spanish template listing: https://es.wiktionary.org/wiki/Especial:Todas?from=es.v.conj&to=&namespace=10


Info

Your desired language should have a similar page to the Spanish template listing.

This listing is very useful when implementing a new language because you can get a list of all templates used and do searches.


Step 3: Get familiar with JWKTL and add support for your language




 

Content by Label
showLabelsfalse
max5
spacessaga131
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel in ("wiktionary","english","language","lemmatizer","spanish") and type = "page" and space = "saga131"
labelsLanguage Lemmatizer Wiktionary English Spanish

...