Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In order to produce the dictionary file an internal tool called "Wiktionary Dump Parser" is used.  This tool uses an open source library called "Java Wiktionary Library" (JWKTL) that makes the parser parsing of the Wiktionary dump file much easier.

In the following diagram you can see different components and the different stages the tool uses to produce a dictionary file:

...

Step 3: Get familiar with JWKTL and add support for your language

Wiktionary was created to be understood by humans which means data is not available in a structured way easily consumable by computers. 

Consuming data from Wiktionary could be of value in many projects including Natural Language Processing (NLP). Even though there are companies that sell dictionaries, prices are usually high.  This is why it makes sense to parse Wiktionary data and create a dictionary that we can use in our projects.

Java Wiktionary Library (JWKTL) is an open source project aimed to ease the parsing of Wiktionary data. Check their site here: https://dkpro.github.io/dkpro-jwktl/

Currently JWKTL only supports English, German and Russian languages if you download the source from their site.

At Accenture, we added Spanish support and you can get the source from Git: https://source.digital.accenture.com/projects/ST/repos/saga-jwktl/browse

For Spanish, we focused on the bare minimum SAGA needs to work. If you want to do the same then it may be a good idea to base your new language on the Spanish parser. If you want to implement a more complete version of the parser then the English parser is a better option.

Following image shows the structure of the JWKTL project. Notice there is a folder for each language. You'll need to add a new folder for your desired language and then based on another language, add the supporting classes accordingly.


Image Added









 

Content by Label
showLabelsfalse
max5
spacessaga131
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel in ("wiktionary","english","language","lemmatizer","spanish") and type = "page" and space = "saga131"
labelsLanguage Lemmatizer Wiktionary English Spanish

...