Page History

...

Include Page

	Generic Configuration Parameters
	Generic Configuration Parameters

Configuration Parameters

...

Parameter
summary Select the language

...

for processing the text

...

name language
- Use the java Locales

...

- . A summarized list can be found in Supported Languages

...

- .
Parameter
summary String containing the characters where the sentences will be

...

split.
name breakers
- The character split is done after the sentence split is

...

- applied to the text

...

- .

...

Tokens marked with this flags will be ignore by this stage, and no process will be performed.

...

Tokens need to have all the specified flags, in order to be processed

...

Enable all debug log functionality of the stage, if any.

...

language	js
theme	Eclipse
title	Example Configuration

Code Block

requiredFlags	text_block
language	js

...

language":"en",

...

breakers":

...

";"

Example Output

...

Code Block
language text

...

------------[

...

Lorem ipsum sit amet, consectetur adipisci. Sed luctus lorem. Cras nec ultricies nulla. Maecenas porta cursus; massa non consectetur.]-------------

...

V  
^--[

...

Lorem ipsum sit amet, consectetur adipiscing]--V--[

...

Sed luctus lorem]--V--[

...

Cras nec ultricies nulla]--V--[

...

Maecenas porta cursus]--V--[

...

massa non consectetur]--^

Output Flags

Lex-Item Flags:

TEXT_BLOCK - Flags all text blocks produced by the TextBreakerStage

...

Vertex Flags:

...

.

...

The current maximum size of a text block is 64K characters.
Text blocks larger than this will be arbitrarily split, and the vertex will be marked with "OVERFLOW_SPLIT"\

...

Resource Data

Description of resource.

Resource Format

The only file which is absolutely required is the entity dictionary. It is a series of JSON records, typically indexed by entity ID.

Description of entity:
Entity JSON Format

Code Block

language	js
theme	Eclipse
title	Entity JSON Format

{
  "id":"Q28260",
  "tags":["{city}", "{administrative-area}", "{geography}"],
  "patterns":[
    "Lincoln", "Lincoln, Nebraska", "Lincoln, NE"
  ],
  "confidence":0.95
  
  . . . additional fields as needed go here . . . 
}

Notes

Multiple entities can have the same pattern.
1. If the pattern is matched, then it will be tagged with multiple (ambiguous) entity IDs.
Additional fielded data can be added to the record
1. As needed by downstream processes.

Fields

id (required, string) - Identifies the entity by unique ID. This identifier must be unique across all entities (across all dictionaries).
- Typically this is an identifier with meaning to the larger application which is using the Language Processing Toolkit.
tags (required, array of string) - The list of semantic tags which will be added to the interpretation graph whenever any of the patterns are matched.
- These will all be added to the interpretation graph with the SEMANTIC_TAG flag.
- Typically, multiple tags are hierarchical representations of the same intent. For example, {city} → {administrative-area} → {geographical-area}
patterns (required, array of string) - A list of patterns to match in the content.
- Patterns will be tokenized and there may be multiple variations which can match.
  - NOTE: Currenty, tokens are separated on simple white-space and punctuation, and then reduced to lowercase.
  - TODO: This will need to be improved in the future, perhaps by specifying a pipeline to perform the tokenization and to allow for multiple variations.
confidence (optional, float) - Specifies the confidence level of the entity, independent of any patterns matched.
- This is the confidence of the entity, in comparison to all of the other entities. Essentially, the likelihood that this entity will be randomly encountered.

Other, Optional Fields

...

SENTENCE - Every text block processed by this breaker will be marked as SENTENCE.

Vertex Flags:

SENTENCE_SPLIT - Indicates the split (start/end) between sentences.
TEXT_BLOCK_SPLIT - Indicates the split of the textblock

...

.

Supported Languages

Lang	Language	Country
ar	Arabic	-
ar-AE	Arabic	United Arab Emirates
ar-BH	Arabic	Bahrain
ar-DZ	Arabic	Algeria
ar-EG	Arabic	Egypt
ar-IQ	Arabic	Iraq
ar-JO	Arabic	Jordan
ar-KW	Arabic	Kuwait
ar-LB	Arabic	Lebanon
ar-LY	Arabic	Libya
ar-MA	Arabic	Morocco
ar-OM	Arabic	Oman
ar-QA	Arabic	Qatar
ar-SA	Arabic	Saudi Arabia
ar-SD	Arabic	Sudan
ar-SY	Arabic	Syria
ar-TN	Arabic	Tunisia
ar-YE	Arabic	Yemen
be-BY	Belarusian	Belarus
bg-BG	Bulgarian	Bulgaria
ca-ES	Catalan	Spain
cs-CZ	Czech	Czech Republic
da-DK	Danish	Denmark
de	German	-
de-AT	German	Austria
de-CH	German	Switzerland
de-DE	German	Germany
de-GR	German	Greece
de-LU	German	Luxembourg
el	Greek	-
el-CY	Greek	Cyprus
el-GR	Greek	Greece
en	English	-
en-AU	English	Australia
en-CA	English	Canada
en-GB	English	United Kingdom
en-IE	English	Ireland
en-IN	English	India
en-MT	English	Malta
en-NZ	English	New Zealand
en-PH	English	Philippines
en-SG	English	Singapore
en-US	English	United States
en-ZA	English	South Africa
es	Spanish	-
es-AR	Spanish	Argentina
es-BO	Spanish	Bolivia
es-CL	Spanish	Chile
es-CO	Spanish	Colombia
es-CR	Spanish	Costa Rica
es-CU	Spanish	Cuba
es-DO	Spanish	Dominican Republic
es-EC	Spanish	Ecuador
es-ES	Spanish	Spain
es-GT	Spanish	Guatemala
es-HN	Spanish	Honduras
es-MX	Spanish	Mexico
es-NI	Spanish	Nicaragua
es-PA	Spanish	Panama
es-PE	Spanish	Peru
es-PR	Spanish	Puerto Rico
es-PY	Spanish	Paraguay
es-SV	Spanish	El Salvador
es-US	Spanish	United States
es-UY	Spanish	Uruguay
es-UY	Spanish	Venezuela
et-EE	Estonian	Estonia
fi-FI	Finnish	Finland
fr	French	-
fr-BE	French	Belgium
fr-CA	French	Canada
fr-CH	French	Switzerland
fr-FR	French	France
fr-LU	French	Luxembourg
ga-IE	Irish	Ireland
he-IL	Hebrew	Israel
hi-IN	Hindi	India
hr-HR	Croatian	Croatia
hu-HU	Hungarian	Hungary
id-ID	Indonesian	Indonesia
is-IS	Icelandic	Iceland
it	Italian	-
it-CH	Italian	Switzerland
it-IT	Italian	Italy
ja	Japanese	-
ja-JP	Japanese	Japan
a-JP-u-ca-japanese-x-lvariant-JP	Japanese	Japan
ko-KR	Korean	South Korea
lt-LT	Lithuanian	Lithuania
lv-LV	Latvian	Latvia
mk-MK	Macedonian	Macedonia
ms-MY	Malay	Malaysia
mt-MT	Maltese	Malta
nl	Dutch	-
nl-BE	Dutch	Belgium
nl-NL	Dutch	Netherlands
nn-NO	Norwegian	Norway
no-NO	Norwegian	Norway
pl-PL	Polish	Poland
pt	Portuguese	-
pt-BR	Portuguese	Brazil
pt-PT	Portuguese	Portugal
ro-RO	Romanian	Romania
ru-RU	Russian	Russia
sk-SK	Slovak	Slovakia
sl-SI	Slovenian	Slovenia
sq-AL	Albanian	Albania
sr	Serbian	-
sr-BA	Serbian	Bosnia and Herzegovina
sr-CS	Serbian	Serbia and Montenegro
sr-Latn	Serbian	-
sr-Latn-BA	Serbian	Bosnia and Herzegovina
sr-Latn-ME	Serbian	Montenegro
sr-Latn-RS	Serbian	Serbia
sr-ME	Serbia	Montenegro
sr-RS	Serbia	Serbia
sv-SE	Swedish	Sweden
th	Thai	-
th-TH	Thai	Thailand
th-TH-u-nu-thai-x-lvariant-TH	Thai	Thailand
tr-TR	Turkish	Turkey
uk-UA	Ukrainian	Ukraine
vi-VN	Vietnamese	Vietnam
zh	Chinese	-
zh-CN	Chinese	China
zh-HK	Chinese	Hong Kong
zh-SG	Chinese	Singapore
zh-TW	Chinese	Taiwan

Page tree

Versions Compared

Old Version 1

New Version Current

Key

Configuration Parameters

Parameter
summary Select the language

Example Output

Code Block
language text

Output Flags

Lex-Item Flags:

Vertex Flags:

Resource Data

Resource Format

Notes

Fields

Other, Optional Fields

Vertex Flags:

Supported Languages

Page tree

Page History

Versions Compared

Old Version 1

New Version Current

Key

Configuration Parameters

ParametersummarySelect the language

Example Output

Code Blocklanguagetext

Output Flags

Lex-Item Flags:

Vertex Flags:

Resource Data

Resource Format

Notes

Fields

Other, Optional Fields

Vertex Flags:

Supported Languages

Parameter
summary Select the language

Code Block
language text