Sentence Filter Stage

This stage flags vertices with “Skip-Sentence”. The vertex flag is the start of the sentence. This can be used to ignore a complete sentence by a later stage.The conditions evaluated by the processor are:

Sentence length, given by the token count, not vertices.
A list of tags that work as an exception to the count, meaning that if the tag is found within the sentence the count is irrelevant and the sentence is not flagged (allow listing).
A list of tags that if found in the sentence it should be flagged (deny listing).

Deny listing a tag always has precedence over the other values, so any sentence with a deny listed flag will always be flagged as “SKIP_SENTENCE”. Allow listed tags will always have precedence over the token limit restriction. And finally token limit restriction is on effect.

Operates On: Lexical Items with VERTEX and possibly other flags as specified below.

Stage can only be used as part of a manual pipeline or a base pipeline

At this moment only the Python Model Recognizer Stage is capable of using this flag.

Generic Configuration Parameters

boundaryFlags ( type=string array | optional ) - List of vertex flags that indicate the beginning and end of a text block.
Tokens to process must be inside two vertices marked with this flag (e.g ["TEXT_BLOCK_SPLIT"])
skipFlags ( type=string array | optional ) - Flags to be skipped by this stage.
Tokens marked with this flag will be ignored by this stage, and no processing will be performed.
requiredFlags ( type=string array | optional ) - Lex items flags required by every token to be processed.
Tokens need to have all of the specified flags in order to be processed.
atLeastOneFlag ( type=string array | optional ) - Lex items flags needed by every token to be processed.
Tokens will need at least one of the flags specified in this array.
confidenceAdjustment ( type=double | default=1 | required ) - Adjustment factor to apply to the confidence value of 0.0 to 2.0 from (Applies for every pattern match).
- 0.0 to < 1.0 decreases confidence value
- 1.0 confidence value remains the same
- > 1.0 to 2.0 increases confidence value
debug ( type=boolean | default=false | optional ) - Enable all debug log functionality for the stage, if any.
enable ( type=boolean | default=true | optional ) - Indicates if the current stage should be consider for the Pipeline Manager
- Only applies for automatic pipeline building

Configuration Parameters

removeSimpleSentence ( type=boolean | default=true | optional ) - Enables marking of the sentence by length limit.
- By enabling this parameter the minTokenOnSentence parameter is taken into account.
minTokensOnSentence ( type=integer | default=3 | optional ) - Equal or less number of tokens in sentence.
- This parameter is inclusive, meaning that sentences up to 3 (by default) tokens long will be flagged.
keepSemanticTags ( type=boolean | default=false | optional ) - Enables the list of tags exceptions for the length limit.
- If the sentence length is within the minimum tokens parameter value but the sentence contains a tag (flagged as SEMANTIC_TAG) with the list of "keep" tags the sentence vertex is not flagged.
tagsList ( type=string | optional ) - List of tags (comma separated) used as exception of the flagging or the vertex..
- At least one of the tags should be present on the sentence in order not to be flagged.
markTagsList ( type=string | optional ) - List of tags used to mark the sentence (flag the vertex).
- At least one of the tags should be present on the sentence.

"removeSimpleSentence": true,
"minTokensOnSentence": 3,
"keepSemanticTags": true,
"tagsList": ["works"],
"markTagsList": ["filtered"]

Example Output

V----------------------[This is short.  This is a longer sentence.  This {works}. This is a {filtered}]-----------------------V
^-[This]-V-[is]-V-[short]-V-[This]-V-[is]-V-[a]-V-[longer]-V-[sentence]-V-[This]-V-{works}-V-[This]-V-[is]-V-[a]-V-{filtered}-^
1                         2                                             3                  4

Vertex 1: SKIP_SENTENCE (3 or lest tokens)
Vertex 2: (larger than 4 tokens)
Vertex 3: (tag {works} found, not flagged)
Vertex 3: SKIP_SENTENCE (tag {filtered} found, flagged)

Output Flags

Lex-Item Flags:

ALL_DIGITS - All of the characters in the token are digits (0-9)
HAS_DIGIT - Tokens produced with at least one digit character are tagged as HAS_DIGIT
HAS_PUNCTUATION - Tokens produced with at least one punctuation character are tagged as HAS_PUNCTUATION. (ALL_PUNCTUATION will not be tagged as HAS_PUNCTUATION).
ALL_PUNCTUATION - Tokens processed or produced composed only of punctuation characters are tagged as ALL_PUNCTUATION.
SEMANTIC_TAG - Identifies all lexical items which are semantic tags.
SKIP_SENTENCE - Identifies all lexical items which should be skipped.

Vertex Flags:

No vertices are created in this stage

SKIP_SENTENCE - Identifies the vertex as the start of a sentence that should be skipped.
ALL_PUNCTUATION - Identifies the vertex as all token
- The default flag if no other flag is present.

Page tree