Text Breaker Stage

Description

Operates On: Lexical Items with TOKEN and possibly other flags as specified below.

Configuration Parameters

language (string, optional) - Select the language to use for processing the text.
- Use the java Locales, a summarized list can be found in Supported Languages
breakers (string, optional) - String containing the characters where the sentences will be split
- The character split is done after the sentence split is perform to the text
boundaryFlags (string, optional)
- The tokens to process must be inside two vertex mark with this flags (e.g ["TEXT_BLOCK_SPLIT"])
skipFlags (string array, optional) - Flags to be skipped by this stage
- Tokens marked with this flags will be ignore by this stage, and no process will be performed.
requiredFlags (string array, optional)
- Tokens need to have all the specified flags, in order to be processed
debug (boolean, optional)
- Enable all debug log functionality of the stage, if any.

Example Configuration

{
 "type":"XXX",
 "language":"en","language":"en",
 "boundaryFlags":["TEXT_BLOCK_SPLIT"], 
 "requiredFlags":["TOKEN", "ALL_LOWER_CASE"],
 "skipFlags": ["SKIP"],
 "debug": true,
}

Example Output

Description

V--------------[abraham lincoln likes macaroni and cheese]--------------------V
^--[abraham]--V--[lincoln]--V--[likes]--V--[macaroni]--V--[and]--V--[cheese]--^
              ^---{place}---^           ^----{food}----^         ^---{food}---^
^----------{person}---------^           ^-----------------{food}--------------^

Output Flags

Lex-Item Flags:

TEXT_BLOCK - Flags all text blocks produced by the TextBreakerStage

Vertex Flags:

none
ALL_PUNCTUATION - Identifies the vertex as all token
- The default flag if no "splitFlag" is present.
<splitFlag> - Defines an alternative flag to ALL_PUNCTUATION, if desired (see above)
CHAR_CHANGE - Identifies the vertex as a change between character formats
TEXT_BLOCK_SPLIT - Identifies the vertex as a split between text blocks.
OVERFLOW_SPLIT - Identifies that an entire buffer was read without finding a split between text blocks.
- The current maximum size of a text block is 64K characters.
- Text blocks larger than this will be arbitrarily split, and the vertex will be marked with "OVERFLOW_SPLIT"\
ALL_WHITESPACE - Identifies that the characters spanned by the vertex are all whitespace characters (spaces, tabs, new-lines, carriage returns, etc.)

Resource Data

Description of resource.

Resource Format

The only file which is absolutely required is the entity dictionary. It is a series of JSON records, typically indexed by entity ID.

Description of entity:
Entity JSON Format

Entity JSON Format

{
  "id":"Q28260",
  "tags":["{city}", "{administrative-area}", "{geography}"],
  "patterns":[
    "Lincoln", "Lincoln, Nebraska", "Lincoln, NE"
  ],
  "confidence":0.95
  
  . . . additional fields as needed go here . . . 
}

Notes

Multiple entities can have the same pattern.
1. If the pattern is matched, then it will be tagged with multiple (ambiguous) entity IDs.
Additional fielded data can be added to the record
1. As needed by downstream processes.

Fields

id (required, string) - Identifies the entity by unique ID. This identifier must be unique across all entities (across all dictionaries).
- Typically this is an identifier with meaning to the larger application which is using the Language Processing Toolkit.
tags (required, array of string) - The list of semantic tags which will be added to the interpretation graph whenever any of the patterns are matched.
- These will all be added to the interpretation graph with the SEMANTIC_TAG flag.
- Typically, multiple tags are hierarchical representations of the same intent. For example, {city} → {administrative-area} → {geographical-area}
patterns (required, array of string) - A list of patterns to match in the content.
- Patterns will be tokenized and there may be multiple variations which can match.
  - NOTE: Currenty, tokens are separated on simple white-space and punctuation, and then reduced to lowercase.
  - TODO: This will need to be improved in the future, perhaps by specifying a pipeline to perform the tokenization and to allow for multiple variations.
confidence (optional, float) - Specifies the confidence level of the entity, independent of any patterns matched.
- This is the confidence of the entity, in comparison to all of the other entities. Essentially, the likelihood that this entity will be randomly encountered.

Other, Optional Fields

display (optional, string) - What to show the user when browsing this entity.
context (optional, object) - A context vector which can help disambiguate this entity from others with the same pattern.
- Format TBD, but probably a list of weighted words, phrases and tags.

Supported Languages

Lang	Language	Country
ar	Arabic	-
ar-AE	Arabic	United Arab Emirates
ar-BH	Arabic	Bahrain
ar-DZ	Arabic	Algeria
ar-EG	Arabic	Egypt
ar-IQ	Arabic	Iraq
ar-JO	Arabic	Jordan
ar-KW	Arabic	Kuwait
ar-LB	Arabic	Lebanon
ar-LY	Arabic	Libya
ar-MA	Arabic	Morocco
ar-OM	Arabic	Oman
ar-QA	Arabic	Qatar
ar-SA	Arabic	Saudi Arabia
ar-SD	Arabic	Sudan
ar-SY	Arabic	Syria
ar-TN	Arabic	Tunisia
ar-YE	Arabic	Yemen
be-BY	Belarusian	Belarus
bg-BG	Bulgarian	Bulgaria
ca-ES	Catalan	Spain
cs-CZ	Czech	Czech Republic
da-DK	Danish	Denmark
de	German	-
de-AT	German	Austria
de-CH	German	Switzerland
de-DE	German	Germany
de-GR	German	Greece
de-LU	German	Luxembourg
el	Greek	-
el-CY	Greek	Cyprus
el-GR	Greek	Greece
en	English	-
en-AU	English	Australia
en-CA	English	Canada
en-GB	English	United Kingdom
en-IE	English	Ireland
en-IN	English	India
en-MT	English	Malta
en-NZ	English	New Zealand
en-PH	English	Philippines
en-SG	English	Singapore
en-US	English	United States
en-ZA	English	South Africa
es	Spanish	-
es-AR	Spanish	Argentina
es-BO	Spanish	Bolivia
es-CL	Spanish	Chile
es-CO	Spanish	Colombia
es-CR	Spanish	Costa Rica
es-CU	Spanish	Cuba
es-DO	Spanish	Dominican Republic
es-EC	Spanish	Ecuador
es-ES	Spanish	Spain
es-GT	Spanish	Guatemala
es-HN	Spanish	Honduras
es-MX	Spanish	Mexico
es-NI	Spanish	Nicaragua
es-PA	Spanish	Panama
es-PE	Spanish	Peru
es-PR	Spanish	Puerto Rico
es-PY	Spanish	Paraguay
es-SV	Spanish	El Salvador
es-US	Spanish	United States
es-UY	Spanish	Uruguay
es-UY	Spanish	Venezuela
et-EE	Estonian	Estonia
fi-FI	Finnish	Finland
fr	French	-
fr-BE	French	Belgium
fr-CA	French	Canada
fr-CH	French	Switzerland
fr-FR	French	France
fr-LU	French	Luxembourg
ga-IE	Irish	Ireland
he-IL	Hebrew	Israel
hi-IN	Hindi	India
hr-HR	Croatian	Croatia
hu-HU	Hungarian	Hungary
id-ID	Indonesian	Indonesia
is-IS	Icelandic	Iceland
it	Italian	-
it-CH	Italian	Switzerland
it-IT	Italian	Italy
ja	Japanese	-
ja-JP	Japanese	Japan
a-JP-u-ca-japanese-x-lvariant-JP	Japanese	Japan
ko-KR	Korean	South Korea
lt-LT	Lithuanian	Lithuania
lv-LV	Latvian	Latvia
mk-MK	Macedonian	Macedonia
ms-MY	Malay	Malaysia
mt-MT	Maltese	Malta
nl	Dutch	-
nl-BE	Dutch	Belgium
nl-NL	Dutch	Netherlands
nn-NO	Norwegian	Norway
no-NO	Norwegian	Norway
pl-PL	Polish	Poland
pt	Portuguese	-
pt-BR	Portuguese	Brazil
pt-PT	Portuguese	Portugal
ro-RO	Romanian	Romania
ru-RU	Russian	Russia
sk-SK	Slovak	Slovakia
sl-SI	Slovenian	Slovenia
sq-AL	Albanian	Albania
sr	Serbian	-
sr-BA	Serbian	Bosnia and Herzegovina
sr-CS	Serbian	Serbia and Montenegro
sr-Latn	Serbian	-
sr-Latn-BA	Serbian	Bosnia and Herzegovina
sr-Latn-ME	Serbian	Montenegro
sr-Latn-RS	Serbian	Serbia
sr-ME	Serbia	Montenegro
sr-RS	Serbia	Serbia
sv-SE	Swedish	Sweden
th	Thai	-
th-TH	Thai	Thailand
th-TH-u-nu-thai-x-lvariant-TH	Thai	Thailand
tr-TR	Turkish	Turkey
uk-UA	Ukrainian	Ukraine
vi-VN	Vietnamese	Vietnam
zh	Chinese	-
zh-CN	Chinese	China
zh-HK	Chinese	Hong Kong
zh-SG	Chinese	Singapore
zh-TW	Chinese	Taiwan

Page tree