Page History

Excerpt
Looks up matches to regular expressions in a dictionary across multiple tokens and then tags the match with one or more semantic tags as an alternative representation(s). For a simple regex expression where a match only needs to occur against a singe token, the Simple Regex Stage is recommended.

Operates On: Lexical Items with TOKEN flagand possibly other flags as specified below.

Saga_is_recognizer

Note
All possibilities are tagged , (including overlaps and sub-patterns, ) with the expectation that later disambiguation stages will choose which tags are the correct interpretation.

Warning

This stage requires a lot of processing time. Please follow these recommendations:

Keep the amount at a minimum to regex patterns.
Try to use non greedy regex.
Set the maxLength to the bare minimum necessary for the expected matches.

Include Page

	Generic Configuration Parameters

...


	Generic Configuration Parameters

Configuration Parameters

Parameter
summary The resource that
patterns (string, required) - The resource which
contains the pattern database.
name patterns
- See below for the format.
maxLength (integer, optional) -
Parameter
summary
The max length of text to test for regex
,
. The default is 25 characters.
default 25
name maxLength
type integer
- For each token, the stage will increase the size by adding tokens before and after, until a match (or the 25 character limit is reach) is reached.
caseInsensitive (boolean, optional) -
Parameter
summary
If true, all regex will be process as case insensitive
(
.
default
=
true
)
boundaryFlags (string, optional)
- The tokens to process must be inside two vertex mark with this flags (e.g ["TEXT_BLOCK_SPLIT"])
skipFlags (string array, optional) - Flags to be skipped by this stage
- Tokens marked with this flags will be ignore by this stage, and no process will be performed.
requiredFlags (string array, optional)
- Tokens need to have all the specified flags, in order to be processed
name caseInsensitive
type boolean
debug (boolean, optional)Enable all debug log functionality of the stage, if any.

Code Block

language	js
theme	Eclipse
title	Example Configuration

{
 "type":"RegexPatternStageRegexPattern",
 "patterns":"regex-provider:patterns",
 "maxLength": 25,
 "caseInsensitive": true
}

Example Output

In the following example, "What's your name" is in the dictionary as a regex for self-name, and there are also regex for number numbers "[0-9]+" and "[0-9]+\\.[0-9]+" :

Code Block

language	text	theme	FadeToGrey

 V--------------------------------------[What's your name 12 @#$ 25 63.3]---------------------------------------V  
  ^-----[What's]-----V--[your]--V--[name]--V-----[12]-----V--[@#$]--V-----[25]-----V-----------[63.3]------------^  
  ^--[What]--V--[s]--^                     ^--[{number}]--^         ^--[{number}]--^-----[63]-----V-----[3]------^  
  ^-----[what's]-----^                                                             ^---------[{number}]----------^  
  ^--[what]--^                                                                     ^--[{number}]--^--[{number}]--^  
  ^-------------[{self-name}]--------------^                                       ^---------[{number}]----------^  
  ^-------------[{self-name}]--------------^  
  ^-------------[{self-name}]--------------^  
  ^-------------[{self-name}]--------------^

Output Flags

Lex-Item Flags:

SEMANTIC_TAG - Identifies all lexical items which that are semantic tags.

Vertex Flags:

Info
No vertices are created in this stage

Resource Data

The regex pattern must have an a "pattern dictionary" (a string to JSON map) which is a list of JSON records, indexed by entity ID. In addition, there may also be a pattern map and a token index.

Pattern (Regex) Dictionary Format

The only required file which is absolutely required is the pattern dictionary. It is a series of JSON records, typically indexed by entity ID.

...

Multiple patterns can have the same entry.
Additional fielded data can be added to the record.
- As needed by downstream processes.

Fields

id (required, string) -
Parameter
summary
Identifies the entity by unique ID. This identifier must be unique across all entries (across all dictionaries).
name id
required true
- Typically, this is an identifier with has meaning to the larger application which that is using the Language Processing Toolkit.Saga.
tags (required, array of string) -
Parameter
summary
The list of semantic tags
which will be added
to add to the interpretation graph whenever any of the patterns are matched.
name tags
type string array
required true
- These will all be added to the interpretation graph with the SEMANTIC_TAG flag.
patterns (required, array of string) -
Parameter
summary A list of patterns to match in the content.
name patterns
type string array
required true
Parameter
summary Indicates whether or not the partial match
splitMatch (optional, boolean) - Indicates if the partialmatch
will create a regex tag even if a full match was not met.

default false
name splitMatch
type boolean
confidence (optional, float) -
Parameter
summary
Specifies the confidence level of
the
the entity, independent of any patterns matched.
name confidence
type double
- This is the confidence of the entry, in comparison to all of the other entries. Essentially, the likelihood that this entry , will be randomly encountered.

Other, Optional Fields

display (optional, string) -
Parameter
summary
What to show the user when browsing
this
the entity.
name display
context (optional, object) -
Parameter
summary
A context vector
which
that can help disambiguate
this
the entity from others with the same pattern.
name context
- Format TBD, but probably a list of weighted words, phrases and tags.

...

Page tree

Versions Compared

Old Version 12

New Version Current

Key