The "Text Extraction" section of the Connector contains the configuration elements specific to text extraction. They are described below:

  • Enable Text Extraction: check this option if document text extraction is required.
    • Override default settings: check this option to override default text extraction settings.
      • Maximum Size: maximum number of characters for extracted text. If "unlimited" is specified as the value, no limit will be applied to the number of extracted text characters. This value will be ignored if the "HTML Output" option is enabled.
      • Timeout: number of milliseconds to wait for the text extraction.
      • Nesting Max Depth : maximum depth for a file inner structure. This value is useful to avoid corrupted files and to block Denial of Service attacks.
      • HTML Output : formats the output as HTML instead of plain text.
      • Apache Tika Configuration Path: path to the Apache Tika configuration file.
      • Override PDFBox properties: check this option to override the PDFBox settings.
        • Enable "Autospace": if enabled, the parser will estimate where spaces should be inserted between words. This is necessary for many PDFs as they do not include explicit whitespace characters.
        • Enable "SupressDuplicateOverlappingText": if enabled the parser will try to remove duplicated text over the same region. This is needed for some PDFs that achieve bolding by re-writing the same text in the same area. It has been reported for some versions of PDFBox (PDFBOX-956/PDFBOX-1155) that this option can slow down extraction substantially and sometimes remove characters that were not in fact duplicated.
        • Enable "ExtractAnnotationText": if enabled, text annotations will be extracted.
        • Enable "SortByPosition": if enabled, text tokens will be sorted by their x/y position before extracting text. This may be necessary for some PDFs (if the text tokens are not rendered "in order"), while for other PDFs it can produce the wrong result (for example if there are 2 columns, the text will be interleaved).
        • Enable "ExtractAcroFormContent": if enabled, content from AcroForms is extracted at the end of the document.
        • Enable "ExtractInlineImages": if enabled, inline embedded OBXImages will be extracted. Be advised that some PDF documents of modest size (~4MB) can contain thousands of embedded images totaling more than 2.5 GB. Also, at least as of PDFBox 1.8.5, there can be surprisingly large memory consumption and/or out of memory errors. Be cautious when enabling this option.
        • Enable "ExtractUniqueInlineImagesOnly": multiple pages within a PDF file might refer to the same underlying image. If set to false, the parser will call the EmbeddedExtractor each time the image appears on a page. This might be desired for some use cases. However, to avoid duplication of extracted images, set this to true.
      • Non-Text Document Filtering : check this option to enable non-text document filtering.
        • Open data stream for non-text documents: enable if non-text documents streaming is needed by the workflow.
        • Identify By: select the method to specify how non-text documents are identified.
          • Extension List: select to specify a comma separated list of non-text document extensions.
          • Regex List File: select to specify the path to a file that contains a list of regex expressions that match the non-text documents. The file must contain one regex expression per line.
      • Metadata Mapping: maps extracted fields to a corresponding destination field. See this page for more details.
  • No labels