Page History

Excerpt
Looks up matches to regular expressions in a dictionary across multiple tokens and then tags the match with one or more semantic tags as an alternative representation. For a simple regex expression, a match only needs to occur against a singe token. Simple Regex Stage is recommended.

Operates On: Lexical Items with TOKEN and possibly other flags as specified below.

Saga_is_recognizer

Note
All possibilities are tagged (including overlaps and sub-patterns) with the expectation that later disambiguation stages will choose which tags are the correct interpretation.

Warning

This stage requires a lot of processing time. Please follow these recommendations:

Keep the amount at a minimum to regex patterns.
Try to use non greedy regex.
Set the maxLength to the bare minimum necessary for the expected matches.

Include Page

	Generic Configuration Parameters
	Generic Configuration Parameters

Configuration Parameters

Parameter
summary The resource that contains the pattern database.
name patterns
- See below for the format.
Parameter
summary The max length of text to test for regex. The default is 25 characters.
default 25
name maxLength
type integer
- For each token, the stage will increase the size by adding tokens before and after, until a match (or the 25 character limit) is reached.
Parameter
summary If true, all regex will be process as case insensitive.
default true
name caseInsensitive
type boolean

Code Block

language	js
theme	Eclipse
title	Example Configuration

{
 "type":"RegexPattern",
 "patterns":"regex-provider:patterns",
 "maxLength": 25,
 "caseInsensitive": true
}

Example Output

In the following example, "What's your name" is in the dictionary as a regex for self-name, and there are also regex for numbers "[0-9]+" and "[0-9]+\\.[0-9]+" :

Code Block

language	text
theme	FadeToGrey

 V--------------------------------------[What's your name 12 @#$ 25 63.3]---------------------------------------V  
  ^-----[What's]-----V--[your]--V--[name]--V-----[12]-----V--[@#$]--V-----[25]-----V-----------[63.3]------------^  
  ^--[What]--V--[s]--^                     ^--[{number}]--^         ^--[{number}]--^-----[63]-----V-----[3]------^  
  ^-----[what's]-----^                                                             ^---------[{number}]----------^  
  ^--[what]--^                                                                     ^--[{number}]--^--[{number}]--^  
  ^-------------[{self-name}]--------------^

Output Flags

Lex-Item Flags

SEMANTIC_TAG - Identifies all lexical items that are semantic tags.
PROCESSED - Placed on all tokens composing the semantic tag.

Vertex Flags:

Info
No vertices are created in this stage

Resource Data

The regex pattern must have a "pattern dictionary" (a string to JSON map) which is a list of JSON records, indexed by entity ID. In addition, there may also be a pattern map and a token index.

Pattern (Regex) Dictionary Format

The only required file is the pattern dictionary. It is a series of JSON records, typically indexed by entity ID.

Each JSON record represents an entity. The format is as follows:

Code Block

language	js
theme	Eclipse
title	Entity JSON Format

{
    "_id" : "ca84",
    "tags" : [ 
        "number"
    ],
    "patterns" : [ 
        "[0-9]+", 
        "[0-9]+\\.[0-9]+"
    ],
    "confidence" : 0.95
  . . . additional fields as needed go here . . . 
}

Notes

Multiple patterns can have the same entry.
Additional fielded data can be added to the record.
- As needed by downstream processes.

Fields

Parameter
summary Identifies the entity by unique ID. This identifier must be unique across all entries (across all dictionaries).
name id
required true
- Typically, this identifier has meaning to the larger application that is using the Language Processing Toolkit.
Parameter
summary The list of semantic tags to add to the interpretation graph whenever any of the patterns are matched.
name tags
type string array
required true
- These will all be added to the interpretation graph with the SEMANTIC_TAG flag.
Parameter
summary A list of patterns to match in the content.
name patterns
type string array
required true
Parameter
summary Indicates whether or not the partialmatch will create a regex tag even if a full match was not met.
default false
name splitMatch
type boolean
Parameter
summary Specifies the confidence level of the entity, independent of any patterns matched.
name confidence
type double
- This is the confidence of the entry, in comparison to all of the other entries. Essentially, the likelihood that this entry will be randomly encountered.

Other, Optional Fields

Parameter
summary What to show the user when browsing the entity.
name display
Parameter
summary A context vector that can help disambiguate the entity from others with the same pattern.
name context
- Format TBD, but probably a list of weighted words, phrases and tags.

Page tree

Versions Compared

Old Version 23

New Version 24

Key

Configuration Parameters

Example Output

Output Flags

Lex-Item Flags

Vertex Flags:

Resource Data

Pattern (Regex) Dictionary Format

Notes

Fields

Other, Optional Fields