The language detector stage uses OpenNLP (https://opennlp.apache.org/) and its language detector model to identify the language of a text block.
It can detect 103 languages outputting ISO 639-3 language codes. (https://opennlp.apache.org/news/model-langdetect-183.html)
It is important to note that the model works better with longer texts that have at least 2 sentences. So it is important to configure this stage earlier in the pipeline and before tokenizing the text.
Operates On: Lexical Items with TEXT_BLOCK flag.
Library: saga-lang-detector-stage
Git: https://source.digital.accenture.com/projects/ST/repos/saga-lang-detector-stage/browse
No configuration parameters are needed.
{ "type":"LangDetectorStage", }
Example Output
As you can see the first sentence is tagged with "LANG_ENG" and the second sentence with "LANG_SPA". For this case a sentence breaker stage was configured before the language detector stage. This way language identification could happen at sentence level.