...
For example, when iterating over the senses of the word 'casas' (Spanish for houses) we will get this from a the JWKTL entry: "{{f.s.p|casa}}"
...
Therefore, the dump parser tool will create a plural type relationship between the word 'casas' and its root 'casa'. This relationship entry is something SAGA understands and will use to reduce the word 'casas' to 'casa' in the Lemmatize stage.
...
1. Get the code from Git: https://source.digital.accenture.com/projects/ST/repos/saga-wiktionary-dump-parser/browse
2. Add a new folder and its corresponding SenseParser and RelationNormalizer files for your language. For As an example, for Spanish we have:
...
Info |
---|
Review existing files for both English nd and Spanish to get an idea of how to implement these 2 files for your new language. Implementation will depend a lot of the specific templates used in your language Wiktionary. |
3. Add page parser instantiation for your language in the method 'GetPageParser' in the class: '\src\main\java\com\searchtechnologies\wiktionary\WiktionaryParser.java':
...
The tools tool is a command line tool. If you run it without any parameter you'll get help information.
Basically you need to run the tool 3 times:
1. First Do the first run with the -parse option in order to parse the Wiktionary dump file and create an index:
...
Example: -parse file=c:/temp/wiktionary.xml output=c:/temp/index
2. Second Do a second run with the -mongo option to read the index and create entries in a MongoDB. (make sure you have a proper MongoDB server instance running)
...
-mongo lang=spa indexDir=c:/temp/index host=localhost port=27017 db=dictionary collection=wiktionary
3. Third run Do a third and last run with the -dict option to read MongoDB collection and produce a JSON file SAGA will eventually use
...
-dict lang=spa indexDir=c:/temp/index outputDir=c:/temp/saga host=localhost port=27017 db=dictionary collection=wiktionary
...
First, rename the file created using the ISO 3 letter language code to: "wiktionary-XXX[3 letter language code here]". So if your new language is German then it should be 'wiktionary-DEU'
...