Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
The Elasticsearch Cache Lookup is a workflow component for Aspire.


Elastic Cache Lookup
Factory Namecom.accenture.aspire:aspireapp-elasticocr-cache-lookupprocessor
subTypejob-input
InputsThe field from which you want to get the value and a field to be created in the Aspire document.
OutputsAspire object that contains a subjob with metadata and checked out content extracted from a specific index in Elasticsearch.the OCR process

Easy Heading Free
navigationTitleOn this Page
navigationExpandOptionexpand-all-by-default

Configuration


This section lists all configuration parameters available to configure the Elastic Cache Lookup componentTesseract OCR component.


ElementTypeDefaultDescription
Elasticsearch
OCR Settings
Server url
tesseractPathtext-
Complete URL where the feeds will be sent. e.g. http://localhost:9200/bulk_AuthenticationalternativesNoneUser with the permissions to read from the Elastic index specifiedIndextextConnection SettingsIdle connection timeoutnumber3600000Maximum time
Tesseract binary location
processTimeoutnumber600000Time (in milliseconds) to
keep an idle connection open.Max connectionsnumber100Maximum number of connections to be opened.Connections per targetnumber10Maximum number of connections opened for the same target.Connection timeoutnumber15000Maximum time (in milliseconds) to wait for the connection.Socket timeoutnumber15000Maximum time (in milliseconds) to wait for a socket response.Connection throttlingbooleanfalseCheckbox to choose to enable connection throttling.Throttling periodnumber5000Time period (in milliseconds) to throttle the connection.Max connections per periodnumber500Maximum number of connections used during the throttling period.Maximum retriesnumber3Maximum number of retries for a failed document.Retry delaynumber5000Time (in milliseconds) to wait before a retry.CacheUse cachealternativestrueResults should be cached in memory.Cache Eviction PolicyalternativessizeHow items should be selected for being deleted from the in-memory cache.Max number of entriesnumber1000Max total number of entries to keep in the cache.Max Total Weight (MB)number500Specifies the maximum weight of entries the cache must contain.Time (min)number5Remove records that have been idle for an amount of time in minutes.Lookup FieldsIndex lookup fieldtext-Specify Elastic index field name for the lookup.Source lookup fieldtext-Specify field name from the incoming AspireObject for the lookup. Field availability will be searched first in 'doc' and then in 'doc.connectorSpecific' section.Uppercase the source lookup field valuebooleantrueConvert the value of the source field into UPPERCASE value.Lookup output fieldtext-Output fields from the lookup will be placed under this configured object.DebugbooleanfalseOption if you want debug messages enabled.Hit sizenumber1000Max mount of hits returned by the cache lookup. If -1 all hits will be returned.

Example Configuration

wait before killing a tesseract process
imageDirectorytext-Directory used to store the temporary files generated during OCR
maxSizetext10mbApply image correction only for those images falls under this size. (i.e. 250kb, 5mb, 1gb)
confidenceThresholdnumber80.0Minimum confidence value to accept the ocr output
Image creation settingsoutputFormatselectjpgImage format (jpg, png, tiff)
imageTypeselectbilevelImage color scale (bilevel, gray, rgba, rgb)
dpinumber300Image dots per inch
Mime Type settingsmimeTypeXPathtext/doc/mimeTypeXpath expression to get the document Mime type
pdfMimeTypesarray-Mime type for PDF documents
imageMimeTypesarray-Mime type for image documents
Page splitter settingsstartPagenumber0Page to start processing with OCR. If value is 0 will start from the first page
endPagenumber20Last page to process with OCR
Advanced settingsprocessThreadsnumber8Max number of threads used by the application
processQueuenumber30Size of application process queue, should be at least 3 times the process threads
backoffTimenumber1000Time (in milliseconds) to wait before trying to add a job to the queue when it is full
debugbooleanfalseCheck if you want debug messages enabled

Example Configuration

Code Block
{
	"tesseractPath": "C:\\Tesseract-OCR\\tesseract",
	"processTimeout": 600000,
	"imageDirectory": "C:\\dev\\tempDir",
	"maxSize": "10mb",
	"confidenceThreshold": 80,
	"outputFormat": "png",
	"imageType": "bilevel",
	"dpi": 300,
	"mimeTypeXPath": "/doc/normalizedMimeType",
	"pdfMimeTypes": "aspire/pdf",
	"imageMimeTypes": "aspire/drawing",
	"startPage": 0,
	"endPage": 20,
	"processThreads": 8,
	"processQueue": 30,
	"backoffTime": 1000,
	"debug": true
}
Code Block
"Elasticsearch Settings":[
	{
  		"url": "http://localhost:9200",
  		"authType": "none",
  		"index": "index_name"
	}
],
"Connection Settings":[
 	{
  		"idleConnectionTimeout": 3600000,
  		"maxConnections": 100,
  		"maxConnectionsPerRoute": 10,
  		"connectionTimeout": 15000,
  		"socketTimeout": 15000,
 		"useThrottling": false,
  		"maxRetries": 3,
  		"retryWaitTime": 5000
	}
],
"Cache": [
	{
  		"cache": true,
  		"eviction": "size",
  		"evictionMaxSize": 1000
	}
],
"Lookup Fields": [
	{
		"esIndexLookupField": "indexNaame",
  		"sourceLookupField": "myid",
  		"sourceLookupFieldToUpperCase": true,
  		"lookupOutputField": "myidOutput",
  		"debug": false,
  		"size": 1000
	}
]