Page History

...

Library: saga-lang-detector-stage

Saga_is_recognizer

Recognizer	false

Tip
It can detect 103 languages outputting ISO 639-3 language codes. (https://opennlp.apache.org/news/model-langdetect-183.html)

Note
The model works better with longer texts containing at least two sentences. It is important to configure this stage earlier in the pipeline and before tokenizing the text.

Include Page

	Generic Configuration Parameters
	Generic Configuration Parameters

Configuration Parameters

No configuration parameters are needed.

Parameter
summary path to the OpenNLP language model
default langdetect-183.bin
name langModel

Code Block

language	js

"atLeastOneFlag": []
"boundaryFlags": ["SENTENCE_SPLIT", "TEXT_BLOCK_SPLIT"]
"confidenceAdjustment": 1
"debug": false
"langModel": "langdetect-183.bin"
"dontProcessFlags": []
"requiredFlags": []
"skipFlags": []

Code Block

language	js
theme	Eclipse
title	Example Configuration

{
 "type":"LangDetectorStage",
}

Example Output

As you can see, the first sentence is tagged with "LANG_ENG" and the second sentence with "LANG_SPA".

In this case, a sentence breaker stage was configured before the language detector stage. As a result, language identification can occur at the sentence level.

Image RemovedImage Added

Output Flags

Lex-Item Flags

TEXT_BLOCK - Flags all text blocks produced by the SimpleReader.
LANG - All tokens where language detection was applied, will have LANG flag for easy detection
LANG_??? - Flags all text blocks where a language was identified.
Note
Notice '???' at the end of the Flag. This is replaced by an ISO three letter language code.
For example, if Spanish is detected, the three letter code is SPA, and the Flag will be "LANG_SPA"

Vertex Flags
none

Info
No vertices are created in this stage

Page tree

Versions Compared

Old Version 7

New Version Current

Key

Configuration Parameters

Example Output

Output Flags

Lex-Item Flags