Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
description | Yes | - | No | Name of the component application. | " |
Elastic Cache Lookup" | ||||
properties | Yes | - | No | Configuration object |
---|
Server url | Yes | - | No | Complete URL where |
C:\Tesseract-OCR\tesseract
processTimeout
Yes
the feeds will be sent. | http://localhost:9200/bulk_ | ||||
Authentication | No | None | Yes | User with the permissions to read from the Elastic index specified. | none, basic, aws |
Index | Yes | - | No | The elastic index to crawl. Index name limitations: 1) Lowercase only. 2) Cannot include \\, \/, ?, \", <, >, |, (space character), ,, # 3) Cannot start with -, _, + 4) | [{"index":"index1"}] |
Idle connection timeout | Yes | 3600000 | No | Maximum time (in milliseconds) to |
600000
keep an idle connection open. | 3600000 |
Max connections | Yes |
100 | No |
Directory used to store the temporary files generated during OCR
C:\tempDir
Maximum number of connections to be opened. | 100 |
Connections per target | Yes |
10 | No |
Apply image correction only for those images falls under this size. (i.e. 250kb, 5mb, 1gb)
10mb
Maximum number of connections opened for the same target. | 10 | ||||
Connection timeout | Yes | 15000 | No | Maximum time (in milliseconds) to wait for the connection. | 15000 |
Socket timeout | Yes | 15000 | No | Maximum time (in milliseconds) to wait for a socket response. | 15000 |
Throttling period | Yes | 5000 | No | Time period (in milliseconds) to throttle the connection. | 5000 |
Max connections per period | Yes | 500 | No | Maximum number of connections used during the throttling period. | 500 |
Maximum retries | Yes | 3 | No | Maximum number of retries for a failed document. | 3 |
Retry delay | Yes | 5000 | No | Time (in milliseconds) to wait before a retry. | 5000 |
Max number of entries | No | 1000 | No | Max total number of entries to keep in the cache. | 1000 |
Max Total Weight (MB) | No | 500 | No | Specifies the maximum weight of entries the cache must contain. | 500 |
Time (min) | No | 5 | No | Remove records that have been idle for an amount of time in minutes. | 5 |
Index lookup field | Yes | - | No | Elastic index field name for the lookup, | [{"index":"index1"}] |
Source lookup field | Yes | - | No | Specify field name from the incoming AspireObject for the lookup. Field availability will be searched first in 'doc' and then in 'doc.connectorSpecific' section. | myid |
Uppercase the source lookup field value | No | true | No | Convert the value of the source field into UPPERCASE value. | FALSE |
Lookup output field | Yes | - | No | Output fields from the lookup will be placed under this configured object. | myidOutput |
Debug |
confidenceThreshold
Yes
80.0
No
Minimum confidence value to accept the ocr output
80.0
outputFormat
Yes
-
No
Image format of the output
png
imageType
Yes
-
No
Image color scale of the output
bilevel
dpi
Yes
300
No
Image dots per inch of the output
300
mimeTypeXPath
Yes
/doc/mimeType
No
Xpath expression to get the document Mime type
/doc/normalizedMimeType
pdfMimeTypes
Yes
-
Yes
Mime type for PDF documents
aspire/pdf
imageMimeTypes
Yes
-
No
Mime type for image documents
aspire/drawing
startPage
Yes
0
No
Page to start processing with OCR. If value is 0 will start from the first page
0
endPage
Yes
20
No
Last page to process with OCR
20
processThreads
Yes
8
No
Max number of threads used by the application
8
processQueue
Yes
30
No
Size of application process queue, should be at least 3 times the process threads
30
backoffTime
Yes
1000
No
Time (in milliseconds) to wait before trying to add a job to the queue when it is full
1000
No | false | No | Option if you want debug messages enabled. | FALSE | |
Hit size | No | 1000 | No | Max mount of hits returned by the cache lookup. If -1 all hits will be returned. | 1000 |
NOTE: The following structure is not ordered by the sections of the component configuration, as found on the Elastic Cache Lookup App Bundle page
Code Block | ||||
---|---|---|---|---|
| ||||
{ "typedescription": "applicationElastic Cache Lookup", "_type "properties": "application", { "appNameurl": "Tesseract Ocrhttp://localhost:9200", "appType "authType": "tesseract-ocrnone", "config "index": "com.accenture.aspire:app-ocr-processorindex_name", "descriptionidleConnectionTimeout": "tesseract-ocr"3600000, "properties "maxConnections": { "tesseractPath": "C:\\Tesseract-OCR\\tesseract", "processTimeout": 600000, "imageDirectory": "C:\\dev\\tempDir", "maxSize": "10mb", "confidenceThreshold": 80, "outputFormat": "png", "imageType": "bilevel", "dpi": 300, "mimeTypeXPath": "/doc/normalizedMimeType", "pdfMimeTypes": "aspire/pdf", "imageMimeTypes": "aspire/drawing", "startPage": 0, "endPage": 20, "processThreads": 8, "processQueue": 30, "backoffTime": 1000, "debug": true } 100, "maxConnectionsPerRoute": 10, "connectionTimeout": 15000, "socketTimeout": 15000, "useThrottling": false, "maxRetries": 3, "retryWaitTime": 5000, "cache": true, "eviction": "size", "evictionMaxSize": 1000, "esIndexLookupField": "indexNaame", "sourceLookupField": "myid", "sourceLookupFieldToUpperCase": false, "lookupOutputField": "myidOutput", "debug": false, "size": 1000 } } |
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
description | Yes | - | No | Name of the component application. | " |
Elastic Cache Lookup" | ||||
properties | Yes | - | No | Configuration object |
Server url | Yes | - | No | Complete URL where |
C:\Tesseract-OCR\tesseract
processTimeout
Yes
the feeds will be sent. | http://localhost:9200/bulk_ | ||||
Authentication | No | None | Yes | User with the permissions to read from the Elastic index specified. | none, basic, aws |
properties | Yes | - | No | Configuration object | |
---|---|---|---|---|---|
Index | Yes | - | No | The elastic index to crawl. Index name limitations: 1) Lowercase only. 2) Cannot include \\, \/, ?, \", <, >, |, (space character), ,, # 3) Cannot start with -, _, + 4) | [{"index":"index1"}] |
Idle connection timeout | Yes | 3600000 | No | Maximum time (in milliseconds) to |
600000
keep an idle connection open. | 3600000 |
Max connections | Yes |
100 | No |
Directory used to store the temporary files generated during OCR
C:\tempDir
Maximum number of connections to be opened. | 100 |
Connections per target | Yes |
10 | No |
Apply image correction only for those images falls under this size. (i.e. 250kb, 5mb, 1gb)
10mb
Maximum number of connections opened for the same target. | 10 | ||||
Connection timeout | Yes | 15000 | No | Maximum time (in milliseconds) to wait for the connection. | 15000 |
Socket timeout | Yes | 15000 | No | Maximum time (in milliseconds) to wait for a socket response. | 15000 |
Throttling period | Yes | 5000 | No | Time period (in milliseconds) to throttle the connection. | 5000 |
Max connections per period | Yes | 500 | No | Maximum number of connections used during the throttling period. | 500 |
Maximum retries | Yes | 3 | No | Maximum number of retries for a failed document. | 3 |
Retry delay | Yes | 5000 | No | Time (in milliseconds) to wait before a retry. | 5000 |
Max number of entries | No | 1000 | No | Max total number of entries to keep in the cache. | 1000 |
Max Total Weight (MB) | No | 500 | No | Specifies the maximum weight of entries the cache must contain. | 500 |
Time (min) | No | 5 | No | Remove records that have been idle for an amount of time in minutes. | 5 |
Index lookup field | Yes | - | No | Elastic index field name for the lookup, | [{"index":"index1"}] |
Source lookup field | Yes | - | No | Specify field name from the incoming AspireObject for the lookup. Field availability will be searched first in 'doc' and then in 'doc.connectorSpecific' section. | myid |
Uppercase the source lookup field value | No | true | No | Convert the value of the source field into UPPERCASE value. | TRUE |
Lookup output field | Yes | - | No | Output fields from the lookup will be placed under this configured object. | myidOutput |
Debug |
confidenceThreshold
Yes
80.0
No
Minimum confidence value to accept the ocr output
80.0
outputFormat
Yes
-
No
Image format of the output
png
imageType
Yes
-
No
Image color scale of the output
bilevel
dpi
Yes
300
No
Image dots per inch of the output
300
mimeTypeXPath
Yes
/doc/mimeType
No
Xpath expression to get the document Mime type
/doc/normalizedMimeType
pdfMimeTypes
Yes
-
Yes
Mime type for PDF documents
aspire/pdf
imageMimeTypes
Yes
-
No
Mime type for image documents
aspire/drawing
startPage
Yes
0
No
Page to start processing with OCR. If value is 0 will start from the first page
0
endPage
Yes
20
No
Last page to process with OCR
20
processThreads
Yes
8
No
Max number of threads used by the application
8
processQueue
Yes
30
No
Size of application process queue, should be at least 3 times the process threads
30
backoffTime
Yes
1000
No
Time (in milliseconds) to wait before trying to add a job to the queue when it is full
1000
No | false | No | Option if you want debug messages enabled. | TRUE | |
Hit size | No | 1000 | No | Max mount of hits returned by the cache lookup. |
If -1 all hits will be returned. | 1000 |
Code Block | ||
---|---|---|
| ||
{ "description": "tesseract-ocrElastic Cache Lookup", "properties": { "tesseractPathurl": "C:\\Tesseract-OCR\\tesseract", "processTimeout": 600000, "imageDirectory": "C:\\dev\\tempDirhttp://localhost:9200", "authType": "none", "index": "index_name", "maxSizeidleConnectionTimeout": "10mb", "confidenceThreshold3600000, "maxConnections": 100, "maxConnectionsPerRoute": 8010, "outputFormat": "png", "imageType": "bilevel", "dpi": 300 "connectionTimeout": 15000, "socketTimeout": 15000, "useThrottling": true, "maxRetries": 3, "retryWaitTime": 5000, "mimeTypeXPathcache": "/doc/normalizedMimeType", "pdfMimeTypestrue, "eviction": "aspire/pdfsize", "imageMimeTypes "evictionMaxSize": "aspire/drawing"1000, "startPageesIndexLookupField": 0"indexNaame", "endPage "sourceLookupField": 20"myid", "processThreads "sourceLookupFieldToUpperCase": 8true, "processQueue "lookupOutputField": 30"myidOutput", "backoffTime "debug": 1000true, "debug": true "size": 1000 } } |