Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Library: saga-lang-detector-stage

Saga_is_recognizer
Recognizerfalse

Tip

It can detect 103 languages outputting ISO 639-3 language codes. (https://opennlp.apache.org/news/model-langdetect-183.html)

Note

The model works better with longer texts containing at least two sentences. It is important to configure this stage earlier in the pipeline and before tokenizing the text.


Include Page
Generic Configuration Parameters
Generic Configuration Parameters

Configuration Parameters

No configuration parameters are needed. 

  • Parameter
    summarypath to the OpenNLP language model
    defaultlangdetect-183.bin
    namelangModel
Code Block
languagejs
"atLeastOneFlag": []
"boundaryFlags": ["SENTENCE_SPLIT", "TEXT_BLOCK_SPLIT"]
"confidenceAdjustment": 1
"debug": false
"langModel": "langdetect-183.bin"
"dontProcessFlags": []
"requiredFlags": []
"skipFlags": []
Code Block
languagejs
themeEclipse
titleExample Configuration
{
 "type":"LangDetectorStage",
}


Example Output

As you can see, the first sentence is tagged with "LANG_ENG" and the second sentence with "LANG_SPA".

In this case, a sentence breaker stage was configured before the language detector stage. As a result, language identification can occur at the sentence level.


Image RemovedImage Added

Output Flags

Lex-Item Flags

  • TEXT_BLOCK - Flags all text blocks produced by the SimpleReader.
  • LANG - All tokens where language detection was applied, will have LANG flag for easy detection
  • LANG_??? - Flags all text blocks where a language was identified. 

    Note

    Notice '???' at the end of the Flag. This is replaced by an ISO three letter language code. 

    For example, if Spanish is detected, the three letter code is SPA, and the Flag will be "LANG_SPA"


    Vertex Flags

    none
Info

No vertices are created in this stage