Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

The language detector stage uses OpenNLP (https://opennlp.apache.org/) and its language detector model to identify the language of a text block.


Operates On:  Lexical Items with TEXT_BLOCK flag.

Tip

It can detect 103 languages outputting ISO 639-3 language codes. (https://opennlp.apache.org/news/model-langdetect-183.html)

Note

It is important to note that the model works better with longer texts that have at least 2 sentences. So it is important to configure this stage earlier in the pipeline and before tokenizing the text

...

.


Library: saga-lang-detector-stage

...