Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Dictionaries are resources used to define entities (usually based on pattern recognition).  The Dictionary Tagger Stage and Simple Regex Stage are examples of the use of dictionaries as resources.  These resources are loaded into Elasticsearch as part of the Saga session.

Structure

A comprehensive definition of the structure of dictionary resources can be seen here. Dictionary resources for different recognizers may differ from one another.  Please check the documentation of the recognizer for the correct specification.


Panel
titleOn this page

Table of Contents
maxLevel4

Import a new dictionary into Elasticsearch

This section will show an example on how to import a dictionary directly into Elasticsearch for the Entity Recognizer (Dictionary Tagger Stage), using Python and an Elasticsearch client.

Code Block
languagecss
themeEclipse
titleTag with Dictionary Tagger Stage
linenumberstrue
{
	"name": "tagname",
	"assigned": {
		"DictionaryTaggerStage": {
			"stage": "DictionaryTaggerStage",
			"display": "Entity",
			"config": {
				"dictionary": test_entities,
				"skipFlags": [],
				"boundaryFlags": [
					"TEXT_BLOCK_SPLIT"
				],
				"requiredFlags": [
					"TOKEN"
				],
				"atLeastOneFlag": [
					"ALL_LOWER_CASE"
				],
				"debug": False
			},
			"enable": True,
			"baseline-pipeline": "baseline-pipeline"
		}

	},
	"updatedAt": <time>,
	"createdAt": <time>
}
  • On line 2, notice the "name" key.  This one is used to define the "tag" name and is the main identifier for the entity.
  • The "assigned" key (line 3) is used to specify the stages linked to the tag.
  • In this example a Dictionary Tagger Stage is used (line 5).
  • The stage configuration defines the dictionary resource used, in this case "test_entities" (line 8).  This is the dictionary we will be creating.
    • The name "test_entities" is based on the pattern "<workspace|session>_entities", where <workspace|session> refers to the postfix used on Elasticsearch to identify the working set of tags.  This can be changed on the configuration file by editing the "indexName" field.
  • The other fields on the JSON structure are stage specific configuration.


This is the basic structure for the dictionary entries:

Code Block
languagecss
themeEclipse
titleJSON document
linenumberstrue
{
	"id": <document_id>,
	"display": <display_label>,
	"fields": {},
	"confAdjust": <confidence_adjustment_value>,
	"updatedAt": <time>,
	"createdAt": <time>,
	"tag": <tag_name>id>,
	"patterns" : ["<pattern1>", "<pattern2>", "<patternN>"]
}	
  • "id" is a unique identifier for the dictionary entry.
  • "display" is the default name of the entry and can be used to normalize the value of the entry if there are multiple patterns assigned to it.
  • "confAdjust" field is the confidence adjustment value assigned to this entry.
  • "createdAt" and "updatedAt" are auditing/control values for the entry.
  • "tag" links the entry to a "tag".
  • "patterns" are the possible values that identify this entry.

This is the basic example, reading from a CSV file and indexing the values on Elasticsearch.  The file is semicolon (";") separated and the first line contains column names (not needed and skipped in the code).

Code Block
languagetext
themeEclipse
titleSample CSV
linenumberstrue
ID;Display;Confidence;Patterns
C0001;engine;1;engine,motor
C0002;wing;1;wing
C0003;landing gear;1;landing gear,tires
...


This is the example Python script which requires an the Elasticsearch module to run found here.

Code Block
languagepy
themeFadeToGrey
titlePython - Dictionary Export
linenumberstrue
# pip install elasticsearch
# pip install datetime

import json
from elasticsearch import Elasticsearch
from datetime import datetime

DEFAULT_ES_READ_TIMEOUT = 120
HOSTS = ['localhost']
BATCH_SIZE = 10000
CSV_PATH = 'inputFile.csv'
TAG = 'testy'
WORKSPACE = "test"

EPOCH = datetime.utcfromtimestamp(0)

# Defaults for new database
DEFAULT_PIPELINE = 'baseline-pipeline'
DEFAULT_PROVIDER_NAME = 'saga-provider'


# util method for timestamp
def unix_time_millis(dt):
    return (dt - EPOCH).total_seconds() * 1000.0


# util class for document indexing
class ElasticSearchClient(object):
    def __init__(self):
        self.es_client = Elasticsearch(HOSTS, timeout=DEFAULT_ES_READ_TIMEOUT)
        self.batch_size = BATCH_SIZE

    def publish(self, index, doc, doc_type, id=None):
        self.es_client.index(index=index, doc_type=doc_type, body=doc, id=id)


def main():

    es_client = ElasticSearchClient()
    # json document
    tag_doc = {
        'name': TAG,
        'assigned': {
            'DictionaryTaggerStage': {
                'stage': 'DictionaryTaggerStage',
                'display': 'Entity',
                'config': {
                    'dictionary': DEFAULT_PROVIDER_NAME + ':' + WORKSPACE + '_entities',
                    'skipFlags': [],
                    'boundaryFlags': [
                        'TEXT_BLOCK_SPLIT'
                    ],
                    'requiredFlags': [
                        'TOKEN'
                    ],
                    'atLeastOneFlag': [
                        'ALL_LOWER_CASE'
                    ],
                    'debug': False
                },
                'enable': True,
                'baseline-pipeline': DEFAULT_PIPELINE
            }

        },
        'updatedAt': unix_time_millis(datetime.now()),
        'createdAt': unix_time_millis(datetime.now()),
    }

    es_client.publish(WORKSPACE + '_tags', tag_doc, 'tag', TAG)

    with open(CSV_PATH, encoding='utf8') as fp:
        line = fp.readline()
        while line:
            line = fp.readline()
            row = line.split(';')

            try:
                if len(row) >= 3:
                    print(row)
                    entry_doc = {
                        'id': row[0].strip(),
                        'display': row[1].strip(),
                        'fields': {},
                        'confAdjust': row[2].strip(),
                        'updatedAt': unix_time_millis(datetime.now()),
                        'createdAt': unix_time_millis(datetime.now()),
                        'tag': TAG,
                        'patterns': row[3].strip().split(',')

                    }
                    es_client.publish(WORKSPACE + '_entities', entry_doc, 'entity')
                else:
                    print("Missing tabs " + line)
            except MemoryError:
                print("Error on: " + line)
    print('Done')


if __name__ == '__main__':
    main()