The language detector stage uses OpenNLP (https://opennlp.apache.org/) and its language detector model to identify the language of a text block.
Operates On: Lexical Items with TEXT_BLOCK flag.
Library: saga-lang-detector-stage
It can detect 103 languages outputting ISO 639-3 language codes. (https://opennlp.apache.org/news/model-langdetect-183.html)
It is important to note that the model works better with longer texts that have at least 2 sentences. So it is important to configure this stage earlier in the pipeline and before tokenizing the text.
Generic Configuration Parameters
Tokens to process must be inside two vertices marked with this flag (e.g ["TEXT_BLOCK_SPLIT"])
Tokens marked with this flag will be ignored by this stage, and no processing will be performed.
Tokens need to have all of the specified flags in order to be processed.
Tokens will need at least one of the flags specified in this array.
No configuration parameters are needed.
{ "type":"LangDetectorStage", }
Example Output
As you can see the first sentence is tagged with "LANG_ENG" and the second sentence with "LANG_SPA". For this case a sentence breaker stage was configured before the language detector stage. This way language identification could happen at sentence level.