Language Detector Stage

The language detector stage uses OpenNLP (https://opennlp.apache.org/) and its language detector model to identify the language of a text block.

Operates On: Lexical Items with TEXT_BLOCK flag.

Library: saga-lang-detector-stage

It can detect 103 languages outputting ISO 639-3 language codes. (https://opennlp.apache.org/news/model-langdetect-183.html)

The model works better with longer texts containing at least two sentences. It is important to configure this stage earlier in the pipeline and before tokenizing the text.

Generic Configuration Parameters

boundaryFlags ( type=string array | optional ) - List of vertex flags that indicate the beginning and end of a text block.
Tokens to process must be inside two vertices marked with this flag (e.g ["TEXT_BLOCK_SPLIT"])
skipFlags ( type=string array | optional ) - Flags to be skipped by this stage.
Tokens marked with this flag will be ignored by this stage, and no processing will be performed.
requiredFlags ( type=string array | optional ) - Lex items flags required by every token to be processed.
Tokens need to have all of the specified flags in order to be processed.
atLeastOneFlag ( type=string array | optional ) - Lex items flags needed by every token to be processed.
Tokens will need at least one of the flags specified in this array.
confidenceAdjustment ( type=double | default=1 | required ) - Adjustment factor to apply to the confidence value of 0.0 to 2.0 from (Applies for every pattern match).
0.0 to < 1.0 decreases confidence value
1.0 confidence value remains the same
> 1.0 to 2.0 increases confidence value
debug ( type=boolean | default=false | optional ) - Enable all debug log functionality for the stage, if any.
enable ( type=boolean | default=true | optional ) - Indicates if the current stage should be consider for the Pipeline Manager
Only applies for automatic pipeline building

Configuration Parameters

No configuration parameters are needed.

Example Configuration

{
 "type":"LangDetectorStage",
}

Example Output

As you can see, the first sentence is tagged with "LANG_ENG" and the second sentence with "LANG_SPA".

In this case, a sentence breaker stage was configured before the language detector stage. As a result, language identification can occur at the sentence level.

Output Flags

Lex-Item Flags

TEXT_BLOCK - Flags all text blocks produced by the SimpleReader.
LANG_??? - Flags all text blocks where a language was identified.
Notice '???' at the end of the Flag. This is replaced by an ISO three letter language code.
For example, if Spanish is detected, the three letter code is SPA, and the Flag will be "LANG_SPA"

Vertex Flags

none

Page tree