Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In order to produce the dictionary file an internal file an Accenture internal tool called "Wiktionary Dump Parser" is used.  This tool uses an open source library called "Java Wiktionary Library" (JWKTL) that makes the parsing of the Wiktionary dump file much easier.

In the following diagram you can see all different components and the different 3 stages the tool uses to produce a dictionary file:

...

Stage 2:  read the contents of the index created in Stage 1, transform them and store entries in a Mongo DB collection.

Stage 3:  read and parse   parse entries stored in the Mongo DB created in Stage 2, then produce a file in a JSON entry per line format that SAGA lemmatizer can use.


On following Following steps you will know where describe where to get Wiktionary dump files, what code you need to add/modify to support an additional language a new language and how to test your new generated your newly generated dictionary file.

It is recommended that you fully read this guide before starting writing code, so you get an better understanding of the big picture.

Step-by-step guide

Step 1: Download Wiktionary dump file

...

Each word you find in Wiktionary is called a page/, article /or entry and it is manually added by volunteers around the globe.

Obviously, languages are different, communities contributing languages and communities contributing to Wiktionary are different, therefore format used to write these entries varies a lot between languages. They could even change vary between entries in the same language, depending .  At the end it depends on who did it and how old the entry is.

...

  • Portions that appear enclosed in curly brackets are called templates. Templates have a name and usually none to N parameters which can be named parameters or numbered parameters (have no name but need to appear in order)
  • When there is a sequence of  2 to 4 equal characters, it usually means it is the start of a new section. In the image above the section Word Sense WordSense is started with === sustantivo femenino|es ===. This section defines the word form (a noun in this case) and all different meanings the word has.

...

Info

Your desired language should have a similar page to the Spanish template listing .(#4 above)

This listing is very useful when implementing a new language because you can get a list of all templates used and do also perform searches.


Step 3: Get familiar with JWKTL and add support for your language

Wiktionary was created to be understood by humans which means data is not available in a structured way easily that is easily consumable by computers. 

Consuming data from Wiktionary could be of value in great value for many projects, including Natural Language Processing (NLP) ones. Even though there There are companies that sell dictionaries, complete dictionaries in a computer readable way but prices are usually high.  This is why it makes sense to parse Wiktionary data and create a dictionary that we can use in our projects.

Java Wiktionary Library (JWKTL) is an open source project created by an university in Germany aimed to ease the parsing of Wiktionary data. Check out their site here: https://dkpro.github.io/dkpro-jwktl/

...

Following image shows the structure of the JWKTL project. Notice there is a folder for each language. You'll need to add a new folder for your desired language. Each language folder has a set of handlers which are the ones actually parsing the wiktionary:



3.1 How JWKTL works?

  1. It will detect the language of the dump file and use the correct parser for the language detected.
  2. For each entry in the dump file:
    1. For each line in the entry:
      1. It will iterate over the list of handlers looking for a handler that can handle the current oneline. So for example, if the current line has the pattern of a new section and the section is Etymology, then the EtymologyHandler will process and extract the information from the line.
      2. A section is usually conformed by several lines,  There is code in place to know if the next line belongs to the same section and needs to be handled by the same handler or if a new section was found and a new handler needs to be found and used. 


Info

Handlers are registered in the WiktionaryEntryParser for each language.  The registration order is important, for example SenseHandler needs to be the last one. The recommendation is to follow the same order defined by the English parser for the handlers you are implementing.

...

2. Add your new language parser instantiation in the onSiteInfoComplete method in the class: src/main/java/de/tudarmstadt/ukp/jwktl/parser/WiktionaryArticleParser.java

3.3 Testing

The project has a lot of unit tests. Just grab some for an existing language and adjust them for your new language.


Step 4:  Update Wiktionary Dump Parser tool for your new language

Even though JWKTL does a lot of parsing and give us information in a structured way, there is still some parsing we need to do in the dump parser tool.

For example, when iterating over the senses of the word 'casas' (Spanish for houses) we will get this from a JWKTL entry: "{{f.s.p|casa}}"

SAGA cannot use exactly that string because nobody will understand it.  The dump parser then needs to parse the string, it will detect that 'f.s.p' is a template used to denote a plural noun form and that the only parameter used is the root word.

Therefore, the dump parser tool will create a plural type relationship between the word 'casas' and its root 'casa'. This relationship entry is something SAGA understands and will use to reduce the word 'casas' to 'casa' in the Lemmatize stage.


4.1 Steps to add your language support

1. Get the code from Git: https://source.digital.accenture.com/projects/ST/repos/saga-wiktionary-dump-parser/browse

2. Add a new folder and its corresponding SenseParser and RelationNormalizer files for your language. For example, for Spanish we have:


Image Added


Info

Review existing files for English nd Spanish to get an idea of how to implement these 2 files for your new language

 

3. Add page parser instantiation for your language in the method 'GetPageParser' in the class: '\src\main\java\com\searchtechnologies\wiktionary\WiktionaryParser.java':

Image Added


3. Add normalizer instantiation for your language in the method 'GetNormalizer' in the class: '\src\main\java\com\searchtechnologies\wiktionary\RelationNormalization.java':

Image Added


4.2 Using the Dump Parser Tool

The tools is a command line tool. If you run it without any parameter you'll get help information.

Basically you need to run the tool 3 times:

1. First run with the -parse option in order to parse the Wiktionary dump file and create an index:

Parameters:  

-parse: Operation flag for parsing  

file:   Downloaded dump file  

output: Output directory where JWKTL will store the DB (index)

Example:   -parse file=c:/temp/wiktionary.xml output=c:/temp/index


2.  Second run with the -mongo option to read the index and create entries in a MongoDB. (make sure you have a proper MongoDB server instance running)

 Parameters:  

-mongo:     Operation flag for adding info to Mongo  

lang:       3 letter ISO code for the language of the Wiktionary File  

indexDir:   Directory where JWKTL index is stored  

host:       MongoDB host name  

port:       MongoDB port number  

db:         MongoDB database name  

collection: MongoDB collection name

Example:  

-mongo lang=spa indexDir=c:/temp/index host=localhost port=27017 db=dictionary collection=wiktionary


3. Third run with the -dict option to read MongoDB collection and produce a JSON file SAGA will eventually use

Parameters:  

-dict:      Operation flag for generating SAGA file  

lang:       3 letter ISO code for the language of the Wiktionary File  

indexDir:   Directory where JWKTL index is stored  

outputDir:  Output directory where SAGA file will be stored  

host:       MongoDB host name  

port:       MongoDB port number  

db:         MongoDB database name  

collection: MongoDB collection name

Example:  

 -dict lang=spa indexDir=c:/temp/index outputDir=c:/temp/saga host=localhost port=27017 db=dictionary collection=wiktionary





Content by Label
showLabelsfalse
max5
spacessaga131
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel in ("wiktionary","english","language","lemmatizer","spanish") and type = "page" and space = "saga131"
labelsLanguage Lemmatizer Wiktionary English Spanish

...