The language detector stage uses OpenNLP (https://opennlp.apache.org/) and its language detector model to identify the language of a text block.
It can detect 103 languages outputting ISO 639-3 language codes. (https://opennlp.apache.org/news/model-langdetect-183.html)
It is important to note that the model works better with longer texts that have at least 2 sentences. So it is important to configure this stage earlier in the pipeline before tokenizing the text.
Operates On: Lexical Items with TEXT_BLOCK flag.
No configuration parameters are needed.
{ "type":"LangDetectorStage", }
Example Output
As you can see the first sentence is tagged with "LANG_ENG" and the second sentence with "LANG_SPA". For these case a sentence breaker stage was configured before the language detector stage. This way language identification could happen at sentence level.