Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

The language detector stage uses OpenNLP (https://opennlp.apache.org/) and its language detector model to identify the language of a text block.


Operates On:  Lexical Items with TEXT_BLOCK flag.

Library: saga-lang-detector-stage

Saga_is_recognizer
Recognizerfalse

Tip

It can detect 103 languages outputting ISO 639-3 language codes. (https://opennlp.apache.org/news/model-langdetect-183.html)

...

Note

The model works better with longer texts

...

containing at least

...

two sentences.

...

It is important to configure this stage earlier in the pipeline and before tokenizing the text.


Include Page
Generic Configuration Parameters
Generic Configuration Parameters
Operates On:  Lexical Items with TEXT_BLOCK flag.

Configuration Parameters

No configuration parameters are needed. 

  • Parameter
    summarypath to the OpenNLP language model
    defaultlangdetect-183.bin
    namelangModel
Code Block
languagejs
"atLeastOneFlag": []
"boundaryFlags": ["SENTENCE_SPLIT", "TEXT_BLOCK_SPLIT"]
"confidenceAdjustment": 1
"debug": false
"langModel": "langdetect-183.bin"
"dontProcessFlags": []
"requiredFlags": []
"skipFlags": []
Code Block
languagejs
themeEclipse
titleExample Configuration
{
 "type":"LangDetectorStage",
}


Example Output

As you can see, the first sentence is tagged with "LANG_ENG" and the second sentence with "LANG_SPA". For

In this case, a sentence breaker stage was configured before the language detector stage. This way As a result, language identification could happen can occur at the sentence level.


Image RemovedImage Added

Output Flags

Lex-Item Flags

...

  • TEXT_BLOCK - Flags all text blocks produced by the SimpleReader.
  • LANG - All tokens where language detection was applied, will have LANG flag for easy detection
  • LANG_??? - Flags all text blocks where a language was identified. 

    Note

    Notice '???' at the end of the Flag. This is replaced by

    a

    an ISO

    3

    three letter language code.

     For

     

    For example, if Spanish is detected,

    3

    the three letter code is SPA,

    then

    and the Flag will be "LANG_SPA"


    Vertex Flags

...

Info

No vertices are created in this stage