Page History

Excerpt
The language detector stage uses OpenNLP (https://opennlp.apache.org/) and its language detector model to identify the language of a text block.

Operates On: Lexical Items with TEXT_BLOCK flag.

Library: saga-lang-detector-stage

Saga_is_recognizer

Recognizer	false

Tip
It can detect 103 languages outputting ISO 639-3 language codes. (https://opennlp.apache.org/news/model-langdetect-183.html)

...

Note
The model works better with longer texts

...

containing at least

...

two sentences.

...

It is important to configure this stage earlier in the pipeline and before tokenizing the text.

Include Page

	Generic Configuration Parameters
	Generic Configuration Parameters

Operates On: Lexical Items with TEXT_BLOCK flag.

Configuration Parameters

No configuration parameters are needed.

Parameter
summary path to the OpenNLP language model
default langdetect-183.bin
name langModel

{ "type":"LangDetectorStage", }

Saga_config_stage

requiredFlags	text_block

Code Block

language	js
theme	Eclipse
title	Example Configuration

Example Output

As you can see, the first sentence is tagged with "LANG_ENG" and the second sentence with "LANG_SPA". For these

In this case, a sentence breaker stage was configured before the language detector stage. This way As a result, language identification could happen can occur at the sentence level.

Image RemovedImage Added

Output Flags

Lex-Item Flags

...

TEXT_BLOCK - Flags all text blocks produced by the SimpleReader.
LANG - All tokens where language detection was applied, will have LANG flag for easy detection
LANG_??? - Flags all text blocks where a language was identified.
Note
Notice '???' at the end of the Flag. This is replaced by
a
an ISO
3
three letter language code.
So

For example, if Spanish is detected,
3
the three letter code is SPA
then
, and the Flag will be "LANG_SPA"

Vertex Flags
...
Info
No vertices are created in this stage

Page tree

Versions Compared

Old Version 1

New Version Current

Key

Configuration Parameters

Example Output

Output Flags

Lex-Item Flags