Stop Words Stage

This Stage flags tokens that are matched to Stop-Words. The flagged tokens will be skipped in subsequent stages (if so indicated on the configuration).

Operates On: Lexical Items with TOKEN and possibly other flags as specified below.

Stage can only be used as part of a manual pipeline or a base pipeline

Generic Configuration Parameters

boundaryFlags ( type=string array | optional ) - List of vertex flags that indicate the beginning and end of a text block.
Tokens to process must be inside two vertices marked with this flag (e.g ["TEXT_BLOCK_SPLIT"])
skipFlags ( type=string array | optional ) - Flags to be skipped by this stage.
Tokens marked with this flag will be ignored by this stage, and no processing will be performed.
requiredFlags ( type=string array | optional ) - Lex items flags required by every token to be processed.
Tokens need to have all of the specified flags in order to be processed.
atLeastOneFlag ( type=string array | optional ) - Lex items flags needed by every token to be processed.
Tokens will need at least one of the flags specified in this array.
confidenceAdjustment ( type=double | default=1 | required ) - Adjustment factor to apply to the confidence value of 0.0 to 2.0 from (Applies for every pattern match).
- 0.0 to < 1.0 decreases confidence value
- 1.0 confidence value remains the same
- > 1.0 to 2.0 increases confidence value
debug ( type=boolean | default=false | optional ) - Enable all debug log functionality for the stage, if any.
enable ( type=boolean | default=true | optional ) - Indicates if the current stage should be consider for the Pipeline Manager
- Only applies for automatic pipeline building

Configuration Parameters

caseInsensitive ( type=boolean | default=true | optional ) - If true, all stop words and tokens will be processed as case insensitive.
stopWords ( type=string | optional ) - The resource containing the list of stop words. Or the direct list of stop words
- See below for the format. If no resource or list is provided, the stage will use the default list of stop words.

Default list of stop words

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

Config as Resource

"caseInsensitive" : true,
"stopWords" : "words-provider:stop_words"

Config as List

"caseInsensitive" : true,
"stopWords" : ["a", "about", "above", "after", "again", "all",
  "am", "an", "and", "the", "i", "who", ...]

Example Output

V--------------[A test to be skipped]--------------V  
^--[A]--V--[test]--V--[to]--V--[be]--V--[skipped]--^  
^--[a]--^  


Item [A] - [TOKEN, STOP_WORD ]
Item [to] - [TOKEN, STOP_WORD ]
Item [be] - [TOKEN, STOP_WORD ]
Item [a] - [TOKEN, STOP_WORD ]

Output Flags

Lex-Item Flags

STOP_WORD - All matched stop words will be marked as STOP_WORD.

Vertex Flags:

No vertices are created in this stage

Resource Data

The resource data will be a json file with an array of words in a field named stopWords.

"stopWords": ["a", "about", "above", "after", "again", "all", "am", "an", "and", "the", "i", "who", ...]

Page tree