Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The Saga Lemmatize Stage uses a dictionary file to identify words and reduce them to their basic form.  For example it will identify that "running" is a verb form (gerund) and will reduce it to its base infinitive form "run".

  • The Lemmatizer will also expand the word to its synonyms in case there are synonyms available in the dictionary. 
  • It could also expand to antonyms, alternative forms and others.

Something important to know is that the Saga Lemmatizer Stage code is agnostic to which reductions or expansions it uses. 

  • The set of reductions and expansions (called Relationship Types) to be used are defined in the dictionary file and the Lemmatize Stage configuration. 
  • This means that when creating the dictionary file for each language, we need to be careful to understand which relationship types we want the lemmatizer to use, and make certain to include them in the dictionary file.

In order to produce the dictionary file an Accenture internal tool called "Wiktionary Dump Parser" is used. 

  • This tool uses an open source library called "Java Wiktionary Library" (JWKTL) that makes parsing a Wiktionary dump file much easier.

In the following diagram, you'll see various components and the three stages the tool uses to produce a dictionary file:

Panel
titleOn this page

Table of Contents



The Wiktionary dump parser tool will:

Stage 1:  Using JWKL, parses the language-specific Wiktionary dump file and produces an index in Oracle Berkeley DB format.

Stage 2:  Reads the contents of the index created in Stage 1, transforms the content and stores entries in a Mongo DB collection.

Stage 3:  Parses entries stored in the Mongo DB created in Stage 2, and then produces a file in a JSON entry per line format that the SAGA lemmatizer understands.

Step by Step Guide


The following steps describe where to get Wiktionary dump files, what code you need to add/modify to support a new language and how to test your newly generated dictionary file.

Info
titleBefore you begin

Please read this guide in its entirety before you start to write code, so you'll have a more thorough understanding of the process.

1 - Download the Wiktionary dump file

Dump files are periodically generated as a backup in the Wikimedia site.  You can see a list of all Wiki backups including Wiktionary for each language at this URL:  https://dumps.wikimedia.org/backup-index.html

In order to find your language, use the ISO two letter language code plus the word 'wiktionary'. 

For example:

  • English:  enwiktionary
  • Spanish:  eswiktionary
  • German:  dewiktionary

Here a useful list of language codes:  https://www.loc.gov/standards/iso639-2/php/code_list.php

For Spanish, you should see something like this:


After you know it exists, click on the link to load a landing page for the language or simply go directly to the file list page using this URL:  https://dumps.wikimedia.org/eswiktionary/latest/

Note the word 'eswiktionary' is used in the URL, you can change that for you desired language. For example 'dewiktionary' in case you want German.

When the page loads, look for a file that ends in "-latest-pages-articles.xml.bz2" and download it. For example, the Spanish file name is "eswiktionary-latest-pages-articles.xml.bz2"

Once downloaded, just unzip it and you are ready to go.

2 - Get familiar with Wiktionary format and language-specific templates

As you may already know, Wiktionary is an multilingual project to create free content dictionary of all words in all languages.

  • Each word you find in Wiktionary is called a page, article or entry and it is manually added by volunteers around the globe.
  • Obviously, languages and communities contributing to Wiktionary are different, therefore the format used to write these entries varies a lot between languages.
    They could even vary between entries for the same language.  Variations usually depend on who did it and how old the entry is.

Before implementing a new language for SAGA Lemmatize Stage, it is a good idea to get familiar with the specifics of the language.

2.1 Inspecting an entry

  1. A good first step is to load wiktionary in your browser:  https://www.wiktionary.org/
  2. Then choose your desired language. 
    For this guide, we'll use the Spanish as example:  https://es.wiktionary.org/wiki/Wikcionario:Portada
  3. Then look for a word. 
    For example: 'casa' (Spanish for home). 
  4. Once the word page is displayed, click on the Editar (Edit) tab.


You should see the entry definition, like this:

In general, for all languages:

  • Portions that appear enclosed in curly brackets are called templates.
    Templates have a name and usually none to N parameters which can be named-parameters or numbered-parameters (have no name but need to appear in order).
  • When there is a sequence of  2 to 4 equal (=) characters, it usually means it is the start of a new section. 
    In the image above, the section WordSense is started with === {{sustantivo femenino|es}} ===. 
    This section defines the word form (a noun in this case) and all different senses the word has.

2.2 Learning more about Wiktionary

For more information, use the Help in the Wiktionary site.  Also look for your language-specific help page. 

Some examples:


Info

Your desired language should have a similar page to the Spanish template listing (#4 above). This listing is very useful when implementing a new language because you can get a list of all available templates. Also, you can perform searches and access the documentation for each one to understand how they work.

3 - Get familiar with JWKTL and add support for your language

Wiktionary was created to be understood by humans, which means data is not available in a structured way that is easily consumable by computers. 

  • Consuming data from Wiktionary could be of great value for many projects, including Natural Language Processing (NLP) ones like SAGA.
  • There are companies that sell complete dictionaries in a computer readable way but prices are usually high.  
  • This is why it makes sense to parse Wiktionary data and create a dictionary that we can use in our projects.

Java Wiktionary Library (JWKTL) is an open source project created by an university in Germany aimed to ease the parsing of Wiktionary data. 


At Accenture, we added Spanish support and you can get the source code from Git:  https://source.digital.accenture.com/projects/ST/repos/saga-jwktl/browse

  • For Spanish, we focused on the bare minimum SAGA needs to make it work. 
  • If you want to do the same, it may be a good idea to base your new language on the Spanish parser (copy, paste and rename files).
  • If you want to implement a more complete version of the parser, then the English parser is a better option to base your work on.


The following image shows the structure of the JWKTL project. 

  • Notice there is a folder for each language. 
  • You'll need to add a new folder for your desired language. 
  • Each language folder has a set of handlers which are the ones actually parsing the Wiktionary.

3.1 How JWKTL works

1. It will detect the language of the dump file and use the correct parser for the language detected.

2. For each entry in the dump file:

  • It will iterate the entry line by line:
    • Send the line to eachone of the Handlers until finding which one can handle the current line. So for example, if the current line has the pattern of a new section and the section is Etymology, then the EtymologyHandler will process and extract the information from the line.
    • A section is usually conformed by several lines.  There is code in place to know if the next line belongs to the same section and needs to be handled by the same handler or if a new section was found and a new handler needs to handle the new section.


Info

Handlers are registered in the WiktionaryEntryParser for each language.  The registration order is important, for example SenseHandler needs to be the last one. The recommendation is to follow the same order defined by the English parser for the handlers you are implementing.

3.2 Additional changes

1. Add your new language as a new static field in this class: src/main/java/de/tudarmstadt/ukp/jwktl/api/util/Language.java  

2. Add your new language parser instantiation in the 'onSiteInfoComplete' method in the class: 'src/main/java/de/tudarmstadt/ukp/jwktl/parser/WiktionaryArticleParser.java'

3.3 Testing

The project has a lot of unit tests. Just grab some for an existing ones and adjust them for your new language.

4 - Update Wiktionary Dump Parser tool for your new language

Even though JWKTL does a lot of parsing and give us information in a structured way, there is still some parsing we need to do in the dump parser tool.

For example, when iterating over the senses of the word 'casas' (Spanish for houses) we will get this from the JWKTL entry: "{{f.s.p|casa}}"

SAGA cannot use exactly that string because nobody will understand it.  The dump parser then needs to parse the string, it will detect that 'f.s.p' is a template used to denote a plural noun form and that the only parameter used is the root word.

Therefore, the dump parser tool will create a plural type relationship between the word 'casas' and its root 'casa'. This relationship entry is something SAGA understands and will use to reduce the word 'casas' to 'casa' in the Lemmatize stage.

4.1 Steps to add your new language

1. Get the code from Git: https://source.digital.accenture.com/projects/ST/repos/saga-wiktionary-dump-parser/browse

2. Add a new folder and its corresponding SenseParser and RelationNormalizer files for your language.

As an example, for Spanish we have:


Info

Review existing files for both English and Spanish to get an idea of how to implement these 2 files for your new language. Implementation will depend a lot of the specific templates used in your language Wiktionary.


3. Add page parser instantiation for your language in the method 'GetPageParser' in the class: '\src\main\java\com\searchtechnologies\wiktionary\WiktionaryParser.java':


4. Add normalizer instantiation for your language in the method 'GetNormalizer' in the class: '\src\main\java\com\searchtechnologies\wiktionary\RelationNormalization.java':

4.2 Using the Dump Parser Tool

The tool is a command line tool. If you run it without any parameter you'll get help information.

Basically you need to run the tool three times:

1. Do the first run with the -parse option in order to parse the Wiktionary dump file and create an index:

Parameters:  

-parse: Operation flag for parsing  

file:   Downloaded dump file  

output: Output directory where JWKTL will store the DB (index)

Example:   -parse file=c:/temp/wiktionary.xml output=c:/temp/index

2.  Do a second run with the -mongo option to read the index and create entries in a MongoDB. (make sure you have a proper MongoDB server instance running)

 Parameters:  

-mongo:     Operation flag for adding info to Mongo  

lang:       3 letter ISO code for the language of the Wiktionary File  

indexDir:   Directory where JWKTL index is stored  

host:       MongoDB host name  

port:       MongoDB port number  

db:         MongoDB database name  

collection: MongoDB collection name

Example:  

-mongo lang=spa indexDir=c:/temp/index host=localhost port=27017 db=dictionary collection=wiktionary

3. Do a third and last run with the -dict option to read MongoDB collection and produce a JSON file SAGA will eventually use

Parameters:  

-dict:      Operation flag for generating SAGA file  

lang:       3 letter ISO code for the language of the Wiktionary File  

indexDir:   Directory where JWKTL index is stored  

outputDir:  Output directory where SAGA file will be stored  

host:       MongoDB host name  

port:       MongoDB port number  

db:         MongoDB database name  

collection: MongoDB collection name

Example:  

 -dict lang=spa indexDir=c:/temp/index outputDir=c:/temp/saga host=localhost port=27017 db=dictionary collection=wiktionary

5 - Add Wiktionary file to Saga Library

Once you have created a new dictionary file for you language, you'll need to add it to Saga-Library so it can be used by the Lemmatize Stage.

First, rename the file created using the ISO 3 letter language code to: "wiktionary-[3 letter language code here]". So if your new language is German then it should be 'wiktionary-DEU'

Then add the file to saga-library GIT: https://source.digital.accenture.com/projects/ST/repos/saga-library/browse at the following path:  \src\main\resources

5.1 Testing

In order to test SAGA Lemmatize stage using your new language dictionary, add new unit tests to the file: "TestLemmatizeStage":


Info

Remember to use the 'languageISO3' parameter with your language code when testing. Otherwise English dictionary is used by default.

Content by Label
showLabelsfalse
max5
spacessaga131
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel in ("wiktionary","english","language","lemmatizer","spanish") and type = "page" and space = "saga131"
labelsLanguage Lemmatizer Wiktionary English Spanish

Page properties
hiddentrue
Related issues