Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The Lemmatizer will also expand the word to its synonyms in case there are synonyms available in the dictionary. It could also expand to antonyms, alternative forms , etcand others.

An important note here is that Something important to know is that Saga Lemmatizer Stage code is agnostic to what reductions or expansions it uses. The set of reductions and expansions (called Relationship Types) to use are  to be used,  are defined in the dictionary file and in the and the Lemmatize Stage configuration. 

...

In order to produce the dictionary file an Accenture internal tool called "Wiktionary Dump Parser" is used.  This tool uses an open source library called "Java Wiktionary Library" (JWKTL) that makes the parsing of the Wiktionary parsing a Wiktionary dump file much easier.

In the following diagram you can 'll see all different components and the 3 stages the tool uses has to produce a dictionary file:

...

Stage 1:  using JWKL, parse the language specific Wiktionary dump file and produce an index in Oracle Berkeley DB format.

Stage 2read reads the contents of the index created in Stage 1, transform transforms them and store stores entries in a Mongo DB collection.

Stage 3:  parse   parses entries stored in the Mongo DB created in Stage 2, then produce produces a file in a JSON entry per line format that SAGA lemmatizer can useunderstands.


Following steps will describe where to get Wiktionary dump files, what code you need to add/modify to support a new language and how to test your newly generated dictionary file.

It is 's recommended that you fully read this guide before starting writing code, so you get an better understanding of the big picture.

Step-by-step guide

Step 1: Download Wiktionary dump file

Dump files are periodically generated as a back up in the Wikimedia site. You can see a list of all Wiki backups including Wiktionary for each language at this URL: https://dumps.wikimedia.org/backup-index.html

In order to find your language, use the ISO 2 letter language code plus the word 'wiktionary'. So for  For example:For

  • English: enwiktionary

...

  • Spanish: eswiktionary

...

  • German: dewiktionary

And so on...

Here a useful list of language codes: https://www.loc.gov/standards/iso639-2/php/code_list.php

...

Once downloaded just unzip it and you are ready to go.


Step 2: Get familiar with Wiktionary format and specific templates for you language

As you may already know, Wiktionary is an multilingual project to create free content dictionary of all words in all languages.

...

Obviously, languages and communities contributing to Wiktionary are different, therefore format used to write these entries varies a lot between languages. They could even vary between entries in entries for the same language.  At the end it depends Variations usually depends on who did it and how old the entry is too.

This means that before implementing a new language for SAGA Lemmatize Stage, it is a good idea to get familiar with the specifics of the language.

2.1 Inspecting an Entry

First step is to load wiktionary in your browser: https://www.wiktionary.org/

...

  • Portions that appear enclosed in curly brackets are called templates. Templates have a name and usually none to N parameters which can be named-parameters or numbered-parameters (have no name but need to appear in order)
  • When there is a sequence of  2 to 4 equal characters(=) characters, it usually means it is the start of a new section. In the image above the section WordSense is started with === {{sustantivo femenino|es}} ===. This section defines the word form (a noun in this case) and all different meanings different senses the word has.

2.2 Learning more about Wiktionary

For more information on templates use information use the Help in the Wiktionary site. For example. Also look for your language specific help page.  Some examples:

  1. English help: https://en.wiktionary.org/wiki/Help:Contents
  2. English templates: https://en.wiktionary.org/wiki/Wiktionary:Templates
  3. Spanish help: https://es.wiktionary.org/wiki/Wikcionario:Ayuda
  4. Spanish template listing: https://es.wiktionary.org/wiki/Especial:Todas?from=es.v.conj&to=&namespace=10

...

Info

Your desired language should have a similar page to the Spanish template listing (#4 above)

This listing is very useful when implementing a new language because you can get a list of all templates used and also perform searches. Also you can perform searches and access the documentation for each one to know how they work.


Step 3: Get familiar with JWKTL and add support for your language

Wiktionary was created to be understood by humans which means data is not available in a structured way that is easily consumable by computers. 

Consuming data from Wiktionary could be of great value for many projects, including Natural Language Processing (NLP) ones like SAGA. There are companies that sell complete dictionaries in a computer readable way but prices are usually high.  This is why it makes sense to parse Wiktionary data and create a dictionary that we can use in our projects.

...

For Spanish, we focused on the bare minimum SAGA needs to work. If you want to do the same, it may be a good idea to base your new language on the Spanish parser (copy, paste and rename files). If you want to implement a more complete version of the parser then the English parser is a better option to base your work on.

Following image shows the structure of the JWKTL project. Notice there is a folder for each language. You'll need to add a new folder for your desired language. Each language folder has a set of handlers which are the ones actually parsing the wiktionary:



3.1 How JWKTL works?

  1. It will detect the language of the dump file and use the correct parser for the language detected.
  2. For each entry in the dump file:
    1. For each line in It will iterate the entry line by line:
      1. It will iterate over the list of handlers looking for a handler that can handle the current Send the line to eachone of the Handlers until finding which one can handle the current line. So for example, if the current line has the pattern of a new section and the section is Etymology, then the EtymologyHandler will process and extract the information from the line. A section is usually conformed by several lines,  There is code in place to know if the next line belongs to the same section and needs to be handled by the same handler or if a new section was found and a new handler needs to be found and usedhandle the new section.


Info

Handlers are registered in the WiktionaryEntryParser for each language.  The registration order is important, for example SenseHandler needs to be the last one. The recommendation is to follow the same order defined by the English parser for the handlers you are implementing.

3.2 Additional

...

changes

...

  1. Add your new language as a new static field in this class: src/main/java/de/tudarmstadt/ukp/jwktl/api/util/Language.java  

...

2. Add your new language parser instantiation in the 'onSiteInfoComplete' method in the class: 'src/main/java/de/tudarmstadt/ukp/jwktl/parser/WiktionaryArticleParser.java'

3.3 Testing

The project has a lot of unit tests. Just grab some for an existing language ones and adjust them for your new language.


Step 4:  Update Wiktionary Dump Parser tool for your new language

Even though JWKTL does a lot of parsing and give us information in a structured way, there is still some parsing we need to do in the dump parser tool.

...

Therefore, the dump parser tool will create a plural type relationship between the word 'casas' and its root 'casa'. This relationship entry is something SAGA understands and will use to reduce the word 'casas' to 'casa' in the Lemmatize stage.


4.1 Steps to add your language support

1. Get the code from Git: https://source.digital.accenture.com/projects/ST/repos/saga-wiktionary-dump-parser/browse

...

3. Add normalizer instantiation for your language in the method 'GetNormalizer' in the class: '\src\main\java\com\searchtechnologies\wiktionary\RelationNormalization.java':


4.2 Using the Dump Parser Tool

The tools is a command line tool. If you run it without any parameter you'll get help information.

...

 -dict lang=spa indexDir=c:/temp/index outputDir=c:/temp/saga host=localhost port=27017 db=dictionary collection=wiktionary


Step 5: Add Wiktionary file to Saga Library

Once you have created a new dictionary file for you language, you'll need to add it to Saga-Library so it can be used by the Lemmatize Stage.

...

Then add the file to saga-library GIT: https://source.digital.accenture.com/projects/ST/repos/saga-library/browse at the following path:  \src\main\resources


5.1 Testing

In order to test SAGA Lemmatize stage using your new language dictionary, add new unit tests to the file: "TestLemmatizeStage":

...

Info

Remember to use the 'languageISO3' parameter with your language code when testing. Otherwise English dictionary is used by default.



Content by Label
showLabelsfalse
max5
spacessaga131
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel in ("wiktionary","english","language","lemmatizer","spanish") and type = "page" and space = "saga131"
labelsLanguage Lemmatizer Wiktionary English Spanish

...