The extract-text component takes the input stream or the input array of bytes and uses Apache Tika to extract the text and metadata from the stream.

Extract Text
Factory Namecom.accenture.aspire:aspire-extract-text
subType

default

Inputsobject['contentStream'] or object['contentBytes'] (the content to be parsed)
Outputs<content> holds the text content extracted from the document. Other metadata output is available and is mapped with the metadata mapper (see below).

Determining the Parser

The method for determining which Apache Tika text extractor to use is as follows:

  1. If <mimeType> element exists within the AspireObject, then use this to look up the parser type.
  2. Otherwise, allow Apache Tika to auto-detect the correct text extractor
    • <fetchUrl> (if it exists) or <url> is set as the Apache Tika "resourceName" to help it automatically determine the correct parser to use.

Extraction Timeouts

If the extraction takes too long, then the thread which is doing the extraction will be forcibly stopped using Thread.interrupt(), and if that doesn't work (after three retries), Thread.stop() with a forced NullPointer exception.

This was done because Apache Tika contains bugs which cause infinite loops for some types of HTML documents.

The situation should be carefully monitored, because if too many of these exceptions occur, Aspire could become unstable.

Configuration

ElementTypeDefaultDescription
extractTimeoutint180000
(3 minutes)
Maximum time to wait (in ms) for the text extraction (Maximum value 180000000 equals 3000 minutes).
maxCharactersint/String1,000,000Maximum number of characters to extract from the document, including white spaces. If the limit is exceeded, the extracted text will be truncated. Use a numeric value or "unlimited."
metadataMap
see belowStandard Metadata Mapper configuration. See below.
wordPerTagbooleantrue (2.2.1 Release)  If words are to be split per XML/HTML tag


Metadata Mapper Configuration

The Extract Text stage contains a large number of additional metadata fields which can be mapped to fields in the AspireObject XML. See Apache Tika for a description of all of the metadata fields extracted. The following ones will be mapped by default. Note that the mappings are specified in order, a higher-level mapping will be preferred over a lower mapping if both are possible.

For more information on metadata formats used below, see:


Apache Tika FieldDefault Output FieldDescription
DC.titletitleThe Dublin Core title of the document.
DC.datemodificationDateDublin Core last modified date, converted to ISO 8601 format.
DC.descriptiondescriptionDublin Core description.
DC.contributorcontributorDublin Core contributor name.
titletitleAny other title (such as PDF title or HTML title) that Apache Tika is able to extract from the document.
createdcreationDateTimeThe creation date-time, typically from PDF properties. Formatted as an ISO 8601 date-time.
Last-ModifiedmodificationDateTimeLast modified date-time. Formatted as an ISO 8601 date-time.
AuthorauthorAuthor name. Typically from PDF document properties.
Content-TypecontentTypeThe HTTP formatted content type of the document.
descriptiondescriptionDocument description from either HTML meta fields or PDF document properties.
languagelanguageAuto-detected language code from Apache Tika.
KeywordskeywordsKeywords field from either HTML meta fields or PDF document properties.

Example Configurations

Simple

  <component name="ExtractText" subType="default" factoryName="aspire-extract-text" />

Complex

  <component name="ExtractText" subType="default" factoryName="aspire-extract-text">
   <extractTimeout>60000</extractTimeout>
   <tikaConfig>config/my-tika-config.xml</tikaConfig>
   <!-- note that all of the default mappings are included automatically -->
   <metadataMap>
    <map from="Keywords" to="newKeywordsField"/>
    <map from="description" to="newDescriptionField"/>
   </metadataMap>
 </component>

Example Output

<doc>
  <fetchUrl>http://www.searchtechnologies.com</fetchUrl> 
  .
  .
  .
  <title source="ExtractTextStage/title">Search Technologies: The independent enterprise search experts</title> 
  <description source="ExtractTextStage/description">We advise companies on enterprise search product selection, and we provide efficient, cost effective implementation and integration services. Search Technologies are the expert in the search space</description> 
  <language source="ExtractTextStage/language">en</language> 
  <extension source="ExtractTextStage">
    <field name="Content-Language">en</field> 
    <field name="Content-Encoding">ISO-8859-1</field> 
    <field name="resourceName">http://www.searchtechnologies.com</field> 
  </extension>
  <content source="ExtractTextStage">
  <![CDATA[ 
	Home
	About Us	Executive Team
	Careers
	Solutions	Enterprise Search Consulting
	Microsoft/Fast ESP Services
	Google Search Appliance
	Open Source Enterprise Search
	SharePoint Search
	RetrievalWare Support & Migration
	Image Management
  .
  .
  .
  </content>
</doc>
  • No labels