...
It's recommended that you read this guide before starting writing code, so you get an better understanding of the big picture.
Dump files are periodically generated as a back up in the Wikimedia site. You can see a list of all Wiki backups including Wiktionary for each language at this URL: https://dumps.wikimedia.org/backup-index.html
...
Once downloaded just unzip it and you are ready to go.
As you may already know, Wiktionary is an multilingual project to create free content dictionary of all words in all languages.
...
This means that before implementing a new language for SAGA Lemmatize Stage, it is a good idea to get familiar with the specifics of the language.
First step is to load wiktionary in your browser: https://www.wiktionary.org/
...
sustantivo femenino|es}}
===. This section defines the word form (a noun in this case) and all different senses the word has.For more information use the Help in the Wiktionary site. Also look for your language specific help page. Some examples:
...
Info |
---|
Your desired language should have a similar page to the Spanish template listing (#4 above) This listing is very useful when implementing a new language because you can get a list of all templates. Also you can perform searches and access the documentation for each one to know how they work. |
Wiktionary was created to be understood by humans which means data is not available in a structured way that is easily consumable by computers.
...
Following image shows the structure of the JWKTL project. Notice there is a folder for each language. You'll need to add a new folder for your desired language. Each language folder has a set of handlers which are the ones actually parsing the wiktionary:
...
Info |
---|
Handlers are registered in the WiktionaryEntryParser for each language. The registration order is important, for example SenseHandler needs to be the last one. The recommendation is to follow the same order defined by the English parser for the handlers you are implementing. |
...
2. Add your new language parser instantiation in the 'onSiteInfoComplete' method in the class: 'src/main/java/de/tudarmstadt/ukp/jwktl/parser/WiktionaryArticleParser.java'
The project has a lot of unit tests. Just grab some for an existing ones and adjust them for your new language.
Even though JWKTL does a lot of parsing and give us information in a structured way, there is still some parsing we need to do in the dump parser tool.
...
Therefore, the dump parser tool will create a plural type relationship between the word 'casas' and its root 'casa'. This relationship entry is something SAGA understands and will use to reduce the word 'casas' to 'casa' in the Lemmatize stage.
1. Get the code from Git: https://source.digital.accenture.com/projects/ST/repos/saga-wiktionary-dump-parser/browse
...
Info |
---|
Review existing files for both English and Spanish to get an idea of how to implement these 2 files for your new language. Implementation will depend a lot of the specific templates used in your language Wiktionary. |
3. Add page parser instantiation for your language in the method 'GetPageParser' in the class: '\src\main\java\com\searchtechnologies\wiktionary\WiktionaryParser.java':
...
3. Add normalizer instantiation for your language in the method 'GetNormalizer' in the class: '\src\main\java\com\searchtechnologies\wiktionary\RelationNormalization.java':
The tool is a command line tool. If you run it without any parameter you'll get help information.
...
-dict lang=spa indexDir=c:/temp/index outputDir=c:/temp/saga host=localhost port=27017 db=dictionary collection=wiktionary
Once you have created a new dictionary file for you language, you'll need to add it to Saga-Library so it can be used by the Lemmatize Stage.
...
Then add the file to saga-library GIT: https://source.digital.accenture.com/projects/ST/repos/saga-library/browse at the following path: \src\main\resources
In order to test SAGA Lemmatize stage using your new language dictionary, add new unit tests to the file: "TestLemmatizeStage":
...
Info |
---|
Remember to use the 'languageISO3' parameter with your language code when testing. Otherwise English dictionary is used by default. |
Content by Label | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...