Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
description | Yes | - | No | Name of the component application. | "tesseract-ocr" |
properties | Configuration object | ||||
tesseractPath | Yes | - | No | Complete URL where the tesseract application is installed | C:\Tesseract-OCR\tesseract |
processTimeout | Yes | 600000 | No | Maximum time (in milliseconds) to wait for the process | 600000 |
imageDirectory | Yes | - | No | Directory |
used |
to |
store |
the |
temporary |
files |
generated |
during |
OCR | C:\tempDir | |||
maxSize | Yes | 10mb | No | Apply |
image |
correction |
only |
for |
those |
images |
falls |
under |
this |
size. |
(i.e. |
250kb, |
5mb, |
1gb) | 10mb | |||
confidenceThreshold | Yes | 80.0 | No | Minimum |
confidence |
value |
to |
accept |
the |
ocr |
output | 80.0 | |||
outputFormat | Yes | - | No | Image |
format |
of |
the |
output | png | |||
imageType | Yes | - | No | Image |
color |
scale |
of |
the |
output | bilevel | |||
dpi | Yes | 300 | No | Image |
dots |
per |
inch |
of |
the |
output | 300 |
mimeTypeXPath | Yes |
/doc/mimeType | No |
Xpath expression to get the document Mime type | /doc/normalizedMimeType |
pdfMimeTypes | Yes |
- |
Yes | Mime type for PDF documents | aspire/pdf |
imageMimeTypes | Yes |
- | No |
Mime type for image documents | aspire/drawing |
startPage | Yes |
0 | No |
Page to start processing with OCR. If value is 0 will start from the first page | 0 | ||||
endPage | Yes | 20 | No | Last page to process with OCR | 20 |
processThreads | Yes | 8 | No | Max number of threads used by the application | 8 |
processQueue | Yes | 30 | No | Size of application process queue, should be at least 3 times the process threads | 30 |
backoffTime | Yes | 1000 | No | Time (in milliseconds) to wait before trying to add a job to the queue when it is full | 1000 |
debug | No | false | No | Option if you want debug messages enabled. | FALSE |
NOTE: The following structure is not ordered by the sections of the component configuration, as found on the Elastic Cache Lookup App Bundle page
Code Block | ||||
---|---|---|---|---|
| ||||
{ "type": "application", "_type": "application", "appName": "Tesseract Ocr", "appType": "tesseract-ocr", "config": "com.accenture.aspire:app-ocr-processor", "description": "tesseract-ocr", "properties": { "tesseractPath": "C:\\Tesseract-OCR\\tesseract", "processTimeout": 600000, "imageDirectory": "C:\\dev\\tempDir", "maxSize": "10mb", "confidenceThreshold": 80, "outputFormat": "png", "imageType": "bilevel", "dpi": 300, "mimeTypeXPath": "/doc/normalizedMimeType", "pdfMimeTypes": "aspire/pdf", "imageMimeTypes": "aspire/drawing", "startPage": 0, "endPage": 20, "processThreads": 8, "processQueue": 30, "backoffTime": 1000, "debug": true } } |
Field | Required | Default | Multiple | Notes | Example |
---|---|---|---|---|---|
description | Yes | - | No | Name of the component application. | " |
tesseract-ocr" |
properties |
---|
Configuration object |
---|
tesseractPath | Yes | - | No | Complete URL where the |
tesseract application is installed | C:\Tesseract-OCR\tesseract | ||||
processTimeout | Yes | 600000 | No | Maximum time (in milliseconds) to wait for the process | 600000 |
imageDirectory |
Yes | - | No |
Directory used to store the temporary files generated during OCR | C:\tempDir | |
maxSize | Yes |
10mb | No |
Apply image correction only for those images falls under this size. (i.e. 250kb, 5mb, 1gb) | 10mb | ||||
confidenceThreshold | Yes | 80.0 | No | Minimum confidence value to accept the ocr output | 80.0 |
outputFormat | Yes | - | No | Image format of the output | png |
imageType | Yes | - | No | Image color scale of the output | bilevel |
dpi | Yes | 300 | No | Image dots per inch of the output | 300 |
mimeTypeXPath | Yes | /doc/mimeType | No | Xpath expression to get the document Mime type | /doc/normalizedMimeType |
pdfMimeTypes | Yes | - | Yes | Mime type for PDF documents | aspire/pdf |
imageMimeTypes | Yes | - | No | Mime type for image documents | aspire/drawing |
startPage | Yes | 0 | No | Page to start processing with OCR. If value is 0 will start from the first page | 0 |
endPage | Yes | 20 | No | Last page to process with OCR | 20 |
processThreads | Yes | 8 | No | Max number of threads used by the application | 8 |
processQueue | Yes | 30 | No | Size of application process queue, should be at least 3 times the process threads | 30 |
backoffTime | Yes | 1000 | No | Time (in milliseconds) to wait before trying to add a job to the queue when it is full | 1000 |
debug |
No | false | No | Option if you want debug messages enabled. |
FALSE |
Code Block | ||
---|---|---|
| ||
{ "description": "tesseract-ocr", "properties": { "tesseractPath": "C:\\Tesseract-OCR\\tesseract", "processTimeout": 600000, "imageDirectory": "C:\\dev\\tempDir", "maxSize": "10mb", "confidenceThreshold": 80, "outputFormat": "png", "imageType": "bilevel", "dpi": 300, "mimeTypeXPath": "/doc/normalizedMimeType", "pdfMimeTypes": "aspire/pdf", "imageMimeTypes": "aspire/drawing", "startPage": 0, "endPage": 20, "processThreads": 8, "processQueue": 30, "backoffTime": 1000, "debug": true } } |