The Elasticsearch Cache Lookup is a workflow component for Aspire.

Elastic Cache Lookup
Factory Name	com.accenture.aspire:aspireapp-elasticocr-cache-lookupprocessor
subType	job-input
Inputs	The field from which you want to get the value and a field to be created in the Aspire document.
Outputs	Aspire object that contains a subjob with metadata and checked out content extracted from a specific index in Elasticsearch.the OCR process

Easy Heading Free

navigationTitle	On this Page
navigationExpandOption	expand-all-by-default

Configuration

This section lists all configuration parameters available to configure the Elastic Cache Lookup componentTesseract OCR component.

	Element	Type	Default	Description

Elasticsearch

OCR Settings

Server url

tesseractPath

text

-

Complete URL where the feeds will be sent. e.g. http://localhost:9200/bulk_AuthenticationalternativesNoneUser with the permissions to read from the Elastic index specifiedIndextextConnection SettingsIdle connection timeoutnumber3600000Maximum time

Tesseract binary location
processTimeout	number	600000	Time (in milliseconds) to

keep an idle connection open.Max connectionsnumber100Maximum number of connections to be opened.Connections per targetnumber10Maximum number of connections opened for the same target.Connection timeoutnumber15000Maximum time (in milliseconds) to wait for the connection.Socket timeoutnumber15000Maximum time (in milliseconds) to wait for a socket response.Connection throttlingbooleanfalseCheckbox to choose to enable connection throttling.Throttling periodnumber5000Time period (in milliseconds) to throttle the connection.Max connections per periodnumber500Maximum number of connections used during the throttling period.Maximum retriesnumber3Maximum number of retries for a failed document.Retry delaynumber5000Time (in milliseconds) to wait before a retry.CacheUse cachealternativestrueResults should be cached in memory.Cache Eviction PolicyalternativessizeHow items should be selected for being deleted from the in-memory cache.Max number of entriesnumber1000Max total number of entries to keep in the cache.Max Total Weight (MB)number500Specifies the maximum weight of entries the cache must contain.Time (min)number5Remove records that have been idle for an amount of time in minutes.Lookup FieldsIndex lookup fieldtext-Specify Elastic index field name for the lookup.Source lookup fieldtext-Specify field name from the incoming AspireObject for the lookup. Field availability will be searched first in 'doc' and then in 'doc.connectorSpecific' section.Uppercase the source lookup field valuebooleantrueConvert the value of the source field into UPPERCASE value.Lookup output fieldtext-Output fields from the lookup will be placed under this configured object.DebugbooleanfalseOption if you want debug messages enabled.Hit sizenumber1000Max mount of hits returned by the cache lookup. If -1 all hits will be returned.

Example Configuration

wait before killing a tesseract process
imageDirectory	text	-	Directory used to store the temporary files generated during OCR
maxSize	text	10mb	Apply image correction only for those images falls under this size. (i.e. 250kb, 5mb, 1gb)
confidenceThreshold	number	80.0	Minimum confidence value to accept the ocr output
Image creation settings	outputFormat	select	jpg	Image format (jpg, png, tiff)
	imageType	select	bilevel	Image color scale (bilevel, gray, rgba, rgb)
	dpi	number	300	Image dots per inch
Mime Type settings	mimeTypeXPath	text	/doc/mimeType	Xpath expression to get the document Mime type
	pdfMimeTypes	array	-	Mime type for PDF documents
	imageMimeTypes	array	-	Mime type for image documents
Page splitter settings	startPage	number	0	Page to start processing with OCR. If value is 0 will start from the first page
Page splitter settings	endPage	number	20	Last page to process with OCR
Advanced settings	processThreads	number	8	Max number of threads used by the application
	processQueue	number	30	Size of application process queue, should be at least 3 times the process threads
	backoffTime	number	1000	Time (in milliseconds) to wait before trying to add a job to the queue when it is full
	debug	boolean	false	Check if you want debug messages enabled

Example Configuration

Code Block

{
	"tesseractPath": "C:\\Tesseract-OCR\\tesseract",
	"processTimeout": 600000,
	"imageDirectory": "C:\\dev\\tempDir",
	"maxSize": "10mb",
	"confidenceThreshold": 80,
	"outputFormat": "png",
	"imageType": "bilevel",
	"dpi": 300,
	"mimeTypeXPath": "/doc/normalizedMimeType",
	"pdfMimeTypes": "aspire/pdf",
	"imageMimeTypes": "aspire/drawing",
	"startPage": 0,
	"endPage": 20,
	"processThreads": 8,
	"processQueue": 30,
	"backoffTime": 1000,
	"debug": true
}

Code Block

"Elasticsearch Settings":[
	{
  		"url": "http://localhost:9200",
  		"authType": "none",
  		"index": "index_name"
	}
],
"Connection Settings":[
 	{
  		"idleConnectionTimeout": 3600000,
  		"maxConnections": 100,
  		"maxConnectionsPerRoute": 10,
  		"connectionTimeout": 15000,
  		"socketTimeout": 15000,
 		"useThrottling": false,
  		"maxRetries": 3,
  		"retryWaitTime": 5000
	}
],
"Cache": [
	{
  		"cache": true,
  		"eviction": "size",
  		"evictionMaxSize": 1000
	}
],
"Lookup Fields": [
	{
		"esIndexLookupField": "indexNaame",
  		"sourceLookupField": "myid",
  		"sourceLookupFieldToUpperCase": true,
  		"lookupOutputField": "myidOutput",
  		"debug": false,
  		"size": 1000
	}
]

Page tree

Versions Compared

Old Version 2

New Version 3

Key

Configuration

Example Configuration

Example Configuration

Page tree

Page History

Versions Compared

Old Version 2

New Version 3

Key

Configuration

Example Configuration

Example Configuration