The language detector stage uses OpenNLP (https://opennlp.apache.org/) and its language detector model to identify the language of a text block.
Operates On: Lexical Items with TEXT_BLOCK flag.
Library: saga-lang-detector-stage
It can detect 103 languages outputting ISO 639-3 language codes. (https://opennlp.apache.org/news/model-langdetect-183.html)
The model works better with longer texts containing at least two sentences. It is important to configure this stage earlier in the pipeline and before tokenizing the text.
Generic Configuration Parameters
Tokens to process must be inside two vertices marked with this flag (e.g ["TEXT_BLOCK_SPLIT"])
Tokens marked with this flag will be ignored by this stage, and no processing will be performed.
Tokens need to have all of the specified flags in order to be processed.
Tokens will need at least one of the flags specified in this array.
No configuration parameters are needed.
{ "type":"LangDetectorStage", }
As you can see, the first sentence is tagged with "LANG_ENG" and the second sentence with "LANG_SPA".
In this case, a sentence breaker stage was configured before the language detector stage. As a result, language identification can occur at the sentence level.