...
Dump files are periodically generated as a back up in the Wikimedia site. You can see a list of all Wiki backups including Wiktionary for each language at this URL: https://dumps.wikimedia.org/backup-index.html
In order to find your language, use the ISO 2 letter language code plus the word 'wiktionary'. So for example:
For English: enwiktionary
For Spanish: eswiktionary
For German: dewiktionary
And so on...
Here a useful list of language codes: https://www.loc.gov/standards/iso639-2/php/code_list.php
For Spanish you should see something like this:
Once you know it exists, you can click on the link to load a landing page for the language, or even simpler, go directly to the file list page using this URL: https://dumps.wikimedia.org/eswiktionary/latest/
Note the word 'eswiktionary' is used in the URL, you can change that for you desired language. For example 'dewiktionary' in case you want the German one.
When the page gets loaded look for a file that ends in "-latest-pages-articles.xml.bz2" and download it. For example, for Spanish the file name is "eswiktionary-latest-pages-articles.xml.bz2"
Once downloaded just unzip it and you are ready to go.
As you may already know, Wiktionary is an multilingual project to create free content dictionary of all words in all languages.
Each word you find in Wiktionary is called a page/article/entry and it is manually added by volunteers around the globe.
All languages are different and they have different communities of people contributing so the format used to write these entries varies a lot between languages.
This means that before implementing a new language for SAGA Lemmatizer it is a good idea to get familiar with the specifics of the language.
First step is to load wiktionary in your browser: https://www.wiktionary.org/
Then choose your desired language, for this guide we'll use the Spanish one: https://es.wiktionary.org/wiki/Wikcionario:Portada
Then look for a word, for example: 'casa' (spanish for home). Once the word page is found and displayed, click on the Edit tab:
You should see the entry definition, like this:
In general and for all languages:
For more information
Info |
---|
Content by Label | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...