The Wiktionary dump parser tool will:
Stage 1: Using JWKL, parses the language-specific Wiktionary dump file and produces an index in Oracle Berkeley DB format.
Stage 2: Reads the contents of the index created in Stage 1, transforms the content and stores entries in a Mongo DB collection.
Stage 3: Parses entries stored in the Mongo DB created in Stage 2, and then produces a file in a JSON entry per line format that the SAGA lemmatizer understands.
The following steps describe where to get Wiktionary dump files, what code you need to add/modify to support a new language and how to test your newly generated dictionary file.
Before you begin
Please read this guide in its entirety before you start to write code, so you'll have a more thorough understanding of the process.
Dump files are periodically generated as a backup in the Wikimedia site. You can see a list of all Wiki backups including Wiktionary for each language at this URL: https://dumps.wikimedia.org/backup-index.html
In order to find your language, use the ISO two letter language code plus the word 'wiktionary'.
For example:
Here a useful list of language codes: https://www.loc.gov/standards/iso639-2/php/code_list.php
For Spanish, you should see something like this:
After you know it exists, click on the link to load a landing page for the language or simply go directly to the file list page using this URL: https://dumps.wikimedia.org/eswiktionary/latest/
Note the word 'eswiktionary' is used in the URL, you can change that for you desired language. For example 'dewiktionary' in case you want German.
When the page loads, look for a file that ends in "-latest-pages-articles.xml.bz2" and download it. For example, the Spanish file name is "eswiktionary-latest-pages-articles.xml.bz2"
Once downloaded, just unzip it and you are ready to go.
As you may already know, Wiktionary is an multilingual project to create free content dictionary of all words in all languages.
Before implementing a new language for SAGA Lemmatize Stage, it is a good idea to get familiar with the specifics of the language.
You should see the entry definition, like this:
In general, for all languages:
sustantivo femenino|es}}
===. For more information, use the Help in the Wiktionary site. Also look for your language-specific help page.
Some examples:
Your desired language should have a similar page to the Spanish template listing (#4 above). This listing is very useful when implementing a new language because you can get a list of all available templates. Also, you can perform searches and access the documentation for each one to understand how they work.
Wiktionary was created to be understood by humans, which means data is not available in a structured way that is easily consumable by computers.
Java Wiktionary Library (JWKTL) is an open source project created by an university in Germany aimed to ease the parsing of Wiktionary data.
At Accenture, we added Spanish support and you can get the source code from Git: https://source.digital.accenture.com/projects/ST/repos/saga-jwktl/browse
The following image shows the structure of the JWKTL project.
1. It will detect the language of the dump file and use the correct parser for the language detected.
2. For each entry in the dump file:
Handlers are registered in the WiktionaryEntryParser for each language. The registration order is important, for example SenseHandler needs to be the last one. The recommendation is to follow the same order defined by the English parser for the handlers you are implementing.
1. Add your new language as a new static field in this class: src/main/java/de/tudarmstadt/ukp/jwktl/api/util/Language.java
2. Add your new language parser instantiation in the 'onSiteInfoComplete' method in the class: 'src/main/java/de/tudarmstadt/ukp/jwktl/parser/WiktionaryArticleParser.java'
The project has a lot of unit tests. Just grab some for an existing ones and adjust them for your new language.
Even though JWKTL does a lot of parsing and give us information in a structured way, there is still some parsing we need to do in the dump parser tool.
For example, when iterating over the senses of the word 'casas' (Spanish for houses) we will get this from the JWKTL entry: "{{f.s.p|casa}}"
SAGA cannot use exactly that string because nobody will understand it. The dump parser then needs to parse the string, it will detect that 'f.s.p' is a template used to denote a plural noun form and that the only parameter used is the root word.
Therefore, the dump parser tool will create a plural type relationship between the word 'casas' and its root 'casa'. This relationship entry is something SAGA understands and will use to reduce the word 'casas' to 'casa' in the Lemmatize stage.
1. Get the code from Git: https://source.digital.accenture.com/projects/ST/repos/saga-wiktionary-dump-parser/browse
2. Add a new folder and its corresponding SenseParser and RelationNormalizer files for your language.
As an example, for Spanish we have:
Review existing files for both English and Spanish to get an idea of how to implement these 2 files for your new language. Implementation will depend a lot of the specific templates used in your language Wiktionary.
3. Add page parser instantiation for your language in the method 'GetPageParser' in the class: '\src\main\java\com\searchtechnologies\wiktionary\WiktionaryParser.java':
4. Add normalizer instantiation for your language in the method 'GetNormalizer' in the class: '\src\main\java\com\searchtechnologies\wiktionary\RelationNormalization.java':
The tool is a command line tool. If you run it without any parameter you'll get help information.
Basically you need to run the tool three times:
1. Do the first run with the -parse option in order to parse the Wiktionary dump file and create an index:
Parameters:
-parse: Operation flag for parsing
file: Downloaded dump file
output: Output directory where JWKTL will store the DB (index)
Example: -parse file=c:/temp/wiktionary.xml output=c:/temp/index
2. Do a second run with the -mongo option to read the index and create entries in a MongoDB. (make sure you have a proper MongoDB server instance running)
Parameters:
-mongo: Operation flag for adding info to Mongo
lang: 3 letter ISO code for the language of the Wiktionary File
indexDir: Directory where JWKTL index is stored
host: MongoDB host name
port: MongoDB port number
db: MongoDB database name
collection: MongoDB collection name
Example:
-mongo lang=spa indexDir=c:/temp/index host=localhost port=27017 db=dictionary collection=wiktionary
3. Do a third and last run with the -dict option to read MongoDB collection and produce a JSON file SAGA will eventually use
Parameters:
-dict: Operation flag for generating SAGA file
lang: 3 letter ISO code for the language of the Wiktionary File
indexDir: Directory where JWKTL index is stored
outputDir: Output directory where SAGA file will be stored
host: MongoDB host name
port: MongoDB port number
db: MongoDB database name
collection: MongoDB collection name
Example:
-dict lang=spa indexDir=c:/temp/index outputDir=c:/temp/saga host=localhost port=27017 db=dictionary collection=wiktionary
Once you have created a new dictionary file for you language, you'll need to add it to Saga-Library so it can be used by the Lemmatize Stage.
First, rename the file created using the ISO 3 letter language code to: "wiktionary-[3 letter language code here]". So if your new language is German then it should be 'wiktionary-DEU'
Then add the file to saga-library GIT: https://source.digital.accenture.com/projects/ST/repos/saga-library/browse at the following path: \src\main\resources
In order to test SAGA Lemmatize stage using your new language dictionary, add new unit tests to the file: "TestLemmatizeStage":
Remember to use the 'languageISO3' parameter with your language code when testing. Otherwise English dictionary is used by default.
There is no content with the specified labels