Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

Breaks a text block into sentences using the BreakIterator of java. This is used to break sentences using punctuation delimiters.  Delimiters can be configured as "breakers".

...


Operates On:  Lexical Items with

...

TEXT_BLOCK and possibly other flags as specified below.

Saga_is_recognizer
Recognizerfalse

Tip

A better alternative could be the Sentence Breaker Stage.



Include Page
Generic Configuration Parameters
Generic Configuration Parameters

Configuration Parameters

...

  • Parameter
    summarySelect the language

...

  • for processing the text

...

  • namelanguage
    • Use the java Locales

...

...

    • .
  • Parameter
    summaryString containing the characters where the sentences will be

...

  • split.
    namebreakers
    • The character split is done after the sentence split is

...

    • applied to the text

...

    • .

...

  • Tokens marked with this flags will be ignore by this stage, and no process will be performed.

...

  • Tokens need to have all the specified flags, in order to be processed

...

  • Enable all debug log functionality of the stage, if any.

...

languagejs
themeEclipse
titleExample Configuration
Code Block
requiredFlagstext_block
languagejs
"

...

language":"en",

...


...

"

...

breakers":

...

";"

Example Output

...

Code Block
languagetext

...

V

...

------------[

...

Lorem ipsum sit amet, consectetur adipisci. Sed luctus lorem. Cras nec ultricies nulla. Maecenas porta cursus; massa non consectetur.]-------------

...

V  
^--[

...

Lorem ipsum sit amet, consectetur adipiscing]--V--[

...

Sed luctus lorem]--V--[

...

Cras nec ultricies nulla]--V--[

...

Maecenas porta cursus]--V--[

...

massa non consectetur]--^  

Output Flags

Lex-Item Flags:

  • TEXT_BLOCK - Flags all text blocks produced by the TextBreakerStage

...

Vertex Flags:

...

  • .

...

  • The current maximum size of a text block is 64K characters.
  • Text blocks larger than this will be arbitrarily split, and the vertex will be marked with "OVERFLOW_SPLIT"\

...

Resource Data

Description of resource.

Resource Format

The only file which is absolutely required is the entity dictionary. It is a series of JSON records, typically indexed by entity ID.

Description of entity:
Entity JSON Format

Code Block
languagejs
themeEclipse
titleEntity JSON Format
{
  "id":"Q28260",
  "tags":["{city}", "{administrative-area}", "{geography}"],
  "patterns":[
    "Lincoln", "Lincoln, Nebraska", "Lincoln, NE"
  ],
  "confidence":0.95
  
  . . . additional fields as needed go here . . . 
}

Notes

  1. Multiple entities can have the same pattern.
    1. If the pattern is matched, then it will be tagged with multiple (ambiguous) entity IDs.
  2. Additional fielded data can be added to the record
    1. As needed by downstream processes.

Fields

  • id (required, string) - Identifies the entity by unique ID. This identifier must be unique across all entities (across all dictionaries).
    • Typically this is an identifier with meaning to the larger application which is using the Language Processing Toolkit.
  • tags (required, array of string) - The list of semantic tags which will be added to the interpretation graph whenever any of the patterns are matched.
    • These will all be added to the interpretation graph with the SEMANTIC_TAG flag.
    • Typically, multiple tags are hierarchical representations of the same intent. For example, {city} → {administrative-area} → {geographical-area}
  • patterns (required, array of string) - A list of patterns to match in the content.
    • Patterns will be tokenized and there may be multiple variations which can match.
      • NOTE:  Currenty, tokens are separated on simple white-space and punctuation, and then reduced to lowercase.
      • TODO:  This will need to be improved in the future, perhaps by specifying a pipeline to perform the tokenization and to allow for multiple variations.
  • confidence (optional, float) - Specifies the confidence level of the entity, independent of any patterns matched.
    • This is the confidence of the entity, in comparison to all of the other entities. Essentially, the likelihood that this entity will be randomly encountered.

Other, Optional Fields

...

  • SENTENCE - Every text block processed by this breaker will be marked as SENTENCE.

Vertex Flags:

  • SENTENCE_SPLIT - Indicates the split (start/end) between sentences.
  • TEXT_BLOCK_SPLIT - Indicates the split of the textblock

...

  • .

Supported Languages

LangLanguageCountry
arArabic-
ar-AEArabicUnited Arab Emirates
ar-BHArabicBahrain
ar-DZArabicAlgeria
ar-EGArabicEgypt
ar-IQArabicIraq
ar-JOArabicJordan
ar-KWArabicKuwait
ar-LBArabicLebanon
ar-LYArabicLibya
ar-MAArabicMorocco
ar-OMArabicOman
ar-QAArabicQatar
ar-SAArabicSaudi Arabia
ar-SDArabicSudan
ar-SYArabicSyria
ar-TNArabicTunisia
ar-YEArabicYemen
be-BYBelarusianBelarus
bg-BGBulgarianBulgaria
ca-ESCatalanSpain
cs-CZCzechCzech Republic
da-DKDanishDenmark
deGerman-
de-ATGermanAustria
de-CHGermanSwitzerland
de-DEGermanGermany
de-GRGermanGreece
de-LUGermanLuxembourg
elGreek-
el-CYGreekCyprus
el-GRGreekGreece
enEnglish-
en-AUEnglishAustralia
en-CAEnglishCanada
en-GBEnglishUnited Kingdom
en-IEEnglishIreland
en-INEnglishIndia
en-MTEnglishMalta
en-NZEnglishNew Zealand
en-PHEnglishPhilippines
en-SGEnglishSingapore
en-USEnglishUnited States
en-ZAEnglishSouth Africa
esSpanish-
es-ARSpanishArgentina
es-BOSpanishBolivia
es-CLSpanishChile
es-COSpanishColombia
es-CRSpanishCosta Rica
es-CUSpanishCuba
es-DOSpanishDominican Republic
es-ECSpanishEcuador
es-ESSpanishSpain
es-GTSpanishGuatemala
es-HNSpanishHonduras
es-MXSpanishMexico
es-NISpanishNicaragua
es-PASpanishPanama
es-PESpanishPeru
es-PRSpanishPuerto Rico
es-PYSpanishParaguay
es-SVSpanishEl Salvador
es-USSpanishUnited States
es-UYSpanishUruguay
es-UYSpanishVenezuela
et-EEEstonianEstonia
fi-FIFinnishFinland
frFrench-
fr-BEFrenchBelgium
fr-CAFrenchCanada
fr-CHFrenchSwitzerland
fr-FRFrenchFrance
fr-LUFrenchLuxembourg
ga-IEIrishIreland
he-ILHebrewIsrael
hi-INHindiIndia
hr-HRCroatianCroatia
hu-HUHungarianHungary
id-IDIndonesianIndonesia
is-ISIcelandicIceland
itItalian-
it-CHItalianSwitzerland
it-ITItalianItaly
jaJapanese-
ja-JPJapaneseJapan
a-JP-u-ca-japanese-x-lvariant-JPJapaneseJapan
ko-KRKoreanSouth Korea
lt-LTLithuanianLithuania
lv-LVLatvianLatvia
mk-MKMacedonianMacedonia
ms-MYMalayMalaysia
mt-MTMalteseMalta
nlDutch-
nl-BEDutchBelgium
nl-NLDutchNetherlands
nn-NONorwegianNorway
no-NONorwegianNorway
pl-PLPolishPoland
ptPortuguese-
pt-BRPortugueseBrazil
pt-PTPortuguesePortugal
ro-RORomanianRomania
ru-RURussianRussia
sk-SKSlovakSlovakia
sl-SISlovenianSlovenia
sq-ALAlbanianAlbania
srSerbian-
sr-BASerbianBosnia and Herzegovina
sr-CSSerbianSerbia and Montenegro
sr-LatnSerbian-
sr-Latn-BASerbianBosnia and Herzegovina
sr-Latn-MESerbianMontenegro
sr-Latn-RSSerbianSerbia
sr-MESerbiaMontenegro
sr-RSSerbiaSerbia
sv-SESwedishSweden
thThai-
th-THThaiThailand
th-TH-u-nu-thai-x-lvariant-THThaiThailand
tr-TRTurkishTurkey
uk-UAUkrainianUkraine
vi-VNVietnameseVietnam
zhChinese-
zh-CNChineseChina
zh-HKChineseHong Kong
zh-SGChineseSingapore
zh-TWChineseTaiwan