You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Description

Operates On:  Lexical Items with TOKEN and possibly other flags as specified below.

Configuration Parameters

  • language (string, optional) - Select the language to use for processing the text.
  • breakers (string, optional) - String containing the characters where the sentences will be split 
    • The character split is done after the sentence split is perform to the text
  • boundaryFlags (string, optional) 
    • The tokens to process must be inside two vertex mark with this flags (e.g ["TEXT_BLOCK_SPLIT"])
  • skipFlags (string array, optional) - Flags to be skipped by this stage
    • Tokens marked with this flags will be ignore by this stage, and no process will be performed.
  • requiredFlags (string array, optional)
    • Tokens need to have all the specified flags, in order to be processed
  • debug (boolean, optional)
    • Enable all debug log functionality of the stage, if any.


Example Configuration
{
 "type":"XXX",
 "language":"en","language":"en",
 "boundaryFlags":["TEXT_BLOCK_SPLIT"], 
 "requiredFlags":["TOKEN", "ALL_LOWER_CASE"],
 "skipFlags": ["SKIP"],
 "debug": true,
}



Example Output

Description

V--------------[abraham lincoln likes macaroni and cheese]--------------------V
^--[abraham]--V--[lincoln]--V--[likes]--V--[macaroni]--V--[and]--V--[cheese]--^
              ^---{place}---^           ^----{food}----^         ^---{food}---^
^----------{person}---------^           ^-----------------{food}--------------^


Output Flags

Lex-Item Flags:

  • TEXT_BLOCK - Flags all text blocks produced by the TextBreakerStage

Vertex Flags:

  • none
  • ALL_PUNCTUATION - Identifies the vertex as all token
    • The default flag if no "splitFlag" is present.
  • <splitFlag> - Defines an alternative flag to ALL_PUNCTUATION, if desired (see above)
  • CHAR_CHANGE -  Identifies the vertex as a change between character formats
  • TEXT_BLOCK_SPLIT - Identifies the vertex as a split between text blocks.
  • OVERFLOW_SPLIT - Identifies that an entire buffer was read without finding a split between text blocks.
    • The current maximum size of a text block is 64K characters.
    • Text blocks larger than this will be arbitrarily split, and the vertex will be marked with "OVERFLOW_SPLIT"\
  • ALL_WHITESPACE - Identifies that the characters spanned by the vertex are all whitespace characters (spaces, tabs, new-lines, carriage returns, etc.)

Resource Data

Description of resource.

Resource Format

The only file which is absolutely required is the entity dictionary. It is a series of JSON records, typically indexed by entity ID.

Description of entity:
Entity JSON Format

Entity JSON Format
{
  "id":"Q28260",
  "tags":["{city}", "{administrative-area}", "{geography}"],
  "patterns":[
    "Lincoln", "Lincoln, Nebraska", "Lincoln, NE"
  ],
  "confidence":0.95
  
  . . . additional fields as needed go here . . . 
}


Notes

  1. Multiple entities can have the same pattern.
    1. If the pattern is matched, then it will be tagged with multiple (ambiguous) entity IDs.
  2. Additional fielded data can be added to the record
    1. As needed by downstream processes.

Fields

  • id (required, string) - Identifies the entity by unique ID. This identifier must be unique across all entities (across all dictionaries).
    • Typically this is an identifier with meaning to the larger application which is using the Language Processing Toolkit.
  • tags (required, array of string) - The list of semantic tags which will be added to the interpretation graph whenever any of the patterns are matched.
    • These will all be added to the interpretation graph with the SEMANTIC_TAG flag.
    • Typically, multiple tags are hierarchical representations of the same intent. For example, {city} → {administrative-area} → {geographical-area}
  • patterns (required, array of string) - A list of patterns to match in the content.
    • Patterns will be tokenized and there may be multiple variations which can match.
      • NOTE:  Currenty, tokens are separated on simple white-space and punctuation, and then reduced to lowercase.
      • TODO:  This will need to be improved in the future, perhaps by specifying a pipeline to perform the tokenization and to allow for multiple variations.
  • confidence (optional, float) - Specifies the confidence level of the entity, independent of any patterns matched.
    • This is the confidence of the entity, in comparison to all of the other entities. Essentially, the likelihood that this entity will be randomly encountered.

Other, Optional Fields

  • display (optional, string) - What to show the user when browsing this entity.
  • context (optional, object) - A context vector which can help disambiguate this entity from others with the same pattern.
    • Format TBD, but probably a list of weighted words, phrases and tags.

Supported Languages


LangLanguageCountry
arArabic-
ar-AEArabicUnited Arab Emirates
ar-BHArabicBahrain
ar-DZArabicAlgeria
ar-EGArabicEgypt
ar-IQArabicIraq
ar-JOArabicJordan
ar-KWArabicKuwait
ar-LBArabicLebanon
ar-LYArabicLibya
ar-MAArabicMorocco
ar-OMArabicOman
ar-QAArabicQatar
ar-SAArabicSaudi Arabia
ar-SDArabicSudan
ar-SYArabicSyria
ar-TNArabicTunisia
ar-YEArabicYemen
be-BYBelarusianBelarus
bg-BGBulgarianBulgaria
ca-ESCatalanSpain
cs-CZCzechCzech Republic
da-DKDanishDenmark
deGerman-
de-ATGermanAustria
de-CHGermanSwitzerland
de-DEGermanGermany
de-GRGermanGreece
de-LUGermanLuxembourg
elGreek-
el-CYGreekCyprus
el-GRGreekGreece
enEnglish-
en-AUEnglishAustralia
en-CAEnglishCanada
en-GBEnglishUnited Kingdom
en-IEEnglishIreland
en-INEnglishIndia
en-MTEnglishMalta
en-NZEnglishNew Zealand
en-PHEnglishPhilippines
en-SGEnglishSingapore
en-USEnglishUnited States
en-ZAEnglishSouth Africa
esSpanish-
es-ARSpanishArgentina
es-BOSpanishBolivia
es-CLSpanishChile
es-COSpanishColombia
es-CRSpanishCosta Rica
es-CUSpanishCuba
es-DOSpanishDominican Republic
es-ECSpanishEcuador
es-ESSpanishSpain
es-GTSpanishGuatemala
es-HNSpanishHonduras
es-MXSpanishMexico
es-NISpanishNicaragua
es-PASpanishPanama
es-PESpanishPeru
es-PRSpanishPuerto Rico
es-PYSpanishParaguay
es-SVSpanishEl Salvador
es-USSpanishUnited States
es-UYSpanishUruguay
es-UYSpanishVenezuela
et-EEEstonianEstonia
fi-FIFinnishFinland
frFrench-
fr-BEFrenchBelgium
fr-CAFrenchCanada
fr-CHFrenchSwitzerland
fr-FRFrenchFrance
fr-LUFrenchLuxembourg
ga-IEIrishIreland
he-ILHebrewIsrael
hi-INHindiIndia
hr-HRCroatianCroatia
hu-HUHungarianHungary
id-IDIndonesianIndonesia
is-ISIcelandicIceland
itItalian-
it-CHItalianSwitzerland
it-ITItalianItaly
jaJapanese-
ja-JPJapaneseJapan
a-JP-u-ca-japanese-x-lvariant-JPJapaneseJapan
ko-KRKoreanSouth Korea
lt-LTLithuanianLithuania
lv-LVLatvianLatvia
mk-MKMacedonianMacedonia
ms-MYMalayMalaysia
mt-MTMalteseMalta
nlDutch-
nl-BEDutchBelgium
nl-NLDutchNetherlands
nn-NONorwegianNorway
no-NONorwegianNorway
pl-PLPolishPoland
ptPortuguese-
pt-BRPortugueseBrazil
pt-PTPortuguesePortugal
ro-RORomanianRomania
ru-RURussianRussia
sk-SKSlovakSlovakia
sl-SISlovenianSlovenia
sq-ALAlbanianAlbania
srSerbian-
sr-BASerbianBosnia and Herzegovina
sr-CSSerbianSerbia and Montenegro
sr-LatnSerbian-
sr-Latn-BASerbianBosnia and Herzegovina
sr-Latn-MESerbianMontenegro
sr-Latn-RSSerbianSerbia
sr-MESerbiaMontenegro
sr-RSSerbiaSerbia
sv-SESwedishSweden
thThai-
th-THThaiThailand
th-TH-u-nu-thai-x-lvariant-THThaiThailand
tr-TRTurkishTurkey
uk-UAUkrainianUkraine
vi-VNVietnameseVietnam
zhChinese-
zh-CNChineseChina
zh-HKChineseHong Kong
zh-SGChineseSingapore
zh-TWChineseTaiwan


  • No labels