Part Of Speech tags a word in a text (corpus) as corresponding to a particular part of speech such as noun, verb, adjective, etc., based on its definition, as well as its context. Using OpenNLP (https://opennlp.apache.org/) and its POS TaggerThe tagging of each token is done with flags, meaning that no semantic tag is created with this stage.

Operates On:  Lexical Items with TOKEN and possibly other flags as specified below.

Library: saga-parts-of-speech-stage

Stage is a Recognizer for Saga Solution, and can also be used as part of a manual pipeline or a base pipeline

Currently only English is supported

Generic Configuration Parameters

  • boundaryFlags ( type=string array | optional ) - List of vertex flags that indicate the beginning and end of a text block.
    Tokens to process must be inside two vertices marked with this flag (e.g ["TEXT_BLOCK_SPLIT"])
  • skipFlags ( type=string array | optional ) - Flags to be skipped by this stage.
    Tokens marked with this flag will be ignored by this stage, and no processing will be performed.
  • requiredFlags ( type=string array | optional ) - Lex items flags required by every token to be processed.
    Tokens need to have all of the specified flags in order to be processed.
  • atLeastOneFlag ( type=string array | optional ) - Lex items flags needed by every token to be processed.
    Tokens will need at least one of the flags specified in this array.
  • confidenceAdjustment ( type=double | default=1 | required ) - Adjustment factor to apply to the confidence value of 0.0 to 2.0 from (Applies for every pattern match).
    • 0.0 to < 1.0  decreases confidence value
    • 1.0 confidence value remains the same
    • > 1.0 to  2.0 increases confidence value
  • debug ( type=boolean | default=false | optional ) - Enable all debug log functionality for the stage, if any.
  • enable ( type=boolean | default=true | optional ) - Indicates if the current stage should be consider for the Pipeline Manager
    • Only applies for automatic pipeline building

Configuration Parameters

  • prob ( type=double | default=0.7 | optional ) - Threshold within a part of speech is accepted as one
  • language ( type=string | default=en | optional ) - prefix of the model to use as part of the speech model. Currently only English is supported
  • modelPath ( type=string | optional ) - Path to the folder where the models are stored


Example Configuration
{
	"prob": 0.7,
	"language": "en",
	"modelPath": null
}


Example Output


V-----------------------------------[Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .]-----------------------------------V 
^-[Pierre]-V-[Vinken]-V-[,]-V-[61]-V-[years]-V-[old]-V-[,]-V-[will]-V-[join]-V-[the]-V-[board]-V-[as]-V-[a]-V-[nonexecutive]-V-[director]-V-[Nov.]-V-[29]-V-[.]-^ 

Item [as] - [TOKEN,ORIGINAL,POS_TOKEN,POS_IN]
Item [years] - [TOKEN,ORIGINAL,POS_TOKEN,POS_NNS]
Item [old] - [TOKEN,ORIGINAL,POS_TOKEN,POS_JJ]
Item [director] - [TOKEN,ORIGINAL,POS_TOKEN,POS_NN]
Item [Pierre] - [TOKEN,ORIGINAL,POS_NNP,POS_TOKEN]
Item [the] - [TOKEN,ORIGINAL,POS_TOKEN,POS_DT]
Item [Nov.] - [TOKEN,ORIGINAL,POS_NNP,POS_TOKEN,HAS_PUNCTUATION]
Item [61] - [ALL_DIGITS,TOKEN,ORIGINAL,POS_TOKEN,POS_CD]
Item [29] - [ALL_DIGITS,TOKEN,ORIGINAL,POS_TOKEN,POS_CD]
Item [will] - [TOKEN,ORIGINAL,POS_TOKEN,POS_MD]
Item [,] - [TOKEN,ORIGINAL,ALL_PUNCTUATION,POS_TOKEN,POS_,]
Item [,] - [TOKEN,ORIGINAL,ALL_PUNCTUATION,POS_TOKEN,POS_,]
Item [join] - [TOKEN,ORIGINAL,POS_TOKEN,POS_VB]
Item [board] - [TOKEN,ORIGINAL,POS_TOKEN,POS_NN]
Item [.] - [TOKEN,ORIGINAL,ALL_PUNCTUATION,POS_TOKEN,POS_.]
Item [nonexecutive] - [TOKEN,ORIGINAL,POS_TOKEN,POS_JJ]
Item [a] - [TOKEN,ORIGINAL,POS_TOKEN,POS_DT]
Item [Vinken] - [TOKEN,ORIGINAL,POS_NNP,POS_TOKEN]


Output Flags

Lex-Item Flags:

  • TOKEN - All tokens produced are tagged as TOKEN 
  • POS_TOKEN -  Identifies the token as recognized as a part of speech
  • POS_??? - Flags all TOKENs where a part of speech was recognized. 

    Notice '???' at the end of the Flag. This is replaced by an acronym of the part-to-speech identified. 

    For example, if a base form verb is detected, the acronym is VB, and the Flag will be "POS_VB"

Vertex Flags:

No vertices are created in this stage

FlagDefinition

POS_CC

Coordinating conjunction

POS_CD

Cardinal number

POS_DT

Determiner

POS_EX

Existential there

POS_FW

Foreign word

POS_IN

Preposition or subordinating conjunction

POS_JJ

Adjective

POS_JJR

Adjective, comparative

POS_JJS

Adjective, superlative

POS_LS

List item marker

POS_MD

Modal

POS_NN

Noun, singular or mass

POS_NNS

Noun, plural

POS_NNP

Proper noun, singular

POS_NNPS

'Proper noun, plural

POS_PDT

Predeterminer

POS_POS

Possessive ending

POS_PRP

Personal pronoun

POS_PRP$

Possessive pronoun

POS_RB

Adverb

POS_RBR

Adverb, comparative

POS_RBS

Adverb, superlative

POS_RP

Particle

POS_SYM

Symbol

POS_TO

to

POS_UH

Interjection

POS_VB

Verb, base form

POS_VBD

Verb, past tense

POS_VBG

Verb, gerund or present participle

POS_VBN

Verb, past participle

POS_VBP

Verb, non-3rd person singular present

POS_VBZ

Verb, 3rd person singular present

POS_WDT

Wh-determiner

POS_WP

Wh-pronoun

POS_WP$

Possessive wh-pronoun

POS_WRB

Wh-adverb