Extract Text

The extract-text component takes the input stream or the input array of bytes and uses Apache Tika to extract the text and metadata from the stream.

Extract Text
Factory Name	com.searchtechnologies.aspire:aspire-extract-text
subType	default
Inputs	object['contentStream'] or object['contentBytes'] (the content to be parsed)
Outputs	<content> holds the text content extracted from the document. Other metadata output is available and is mapped with the metadata mapper (see below).

Determining the Parser

The method for determining which Apache Tika text extractor to use is as follows:

If <mimeType> element exists within the AspireObject, then use this to look up the parser type.
Otherwise, allow Apache Tika to auto-detect the correct text extractor
- <fetchUrl> (if it exists) or <url> is set as the Apache Tika "resourceName" to help it automatically determine the correct parser to use.

Extraction Timeouts

If the extraction takes too long, then the thread which is doing the extraction will be forcibly stopped using Thread.interrupt(), and if that doesn't work (after three retries), Thread.stop() with a forced NullPointer exception.

This was done because Apache Tika contains bugs which cause infinite loops for some types of HTML documents.

The situation should be carefully monitored, because if too many of these exceptions occur, Aspire could become unstable.

Configuration

Element	Type	Default	Description
extractTimeout	int	180000 (3 minutes)	Maximum time to wait (in ms) for the text extraction (Maximum value 180000000 equals 3000 minutes).
maxCharacters	int/String	1,000,000	Maximum number characters to extract from the document. If the limit is exceeded, the extracted text will be truncated. Use a numeric value or "unlimited."
metadataMap		see below	Standard Metadata Mapper configuration. See below.
wordPerTag	boolean	true	(2.2.1 Release) If words are to be split per XML/HTML tag

Metadata Mapper Configuration

The Extract Text stage contains a large number of additional metadata fields which can be mapped to fields in the AspireObject XML. See Apache Tika for a description of all of the metadata fields extracted. The following ones will be mapped by default. Note that the mappings are specified in order, a higher-level mapping will be preferred over a lower mapping if both are possible.

For more information on metadata formats used below, see:

Dublin Core
ISO 8601 date time format

Apache Tika Field	Default Output Field	Description
DC.title	title	The Dublin Core title of the document.
DC.date	modificationDate	Dublin Core last modified date, converted to ISO 8601 format.
DC.description	description	Dublin Core description.
DC.contributor	contributor	Dublin Core contributor name.
title	title	Any other title (such as PDF title or HTML title) that Apache Tika is able to extract from the document.
created	creationDateTime	The creation date-time, typically from PDF properties. Formatted as an ISO 8601 date-time.
Last-Modified	modificationDateTime	Last modified date-time. Formatted as an ISO 8601 date-time.
Author	author	Author name. Typically from PDF document properties.
Content-Type	contentType	The HTTP formatted content type of the document.
description	description	Document description from either HTML meta fields or PDF document properties.
language	language	Auto-detected language code from Apache Tika.
Keywords	keywords	Keywords field from either HTML meta fields or PDF document properties.

Example Configurations

Simple

  <component name="ExtractText" subType="default" factoryName="aspire-extract-text" />

Complex

  <component name="ExtractText" subType="default" factoryName="aspire-extract-text">
   <extractTimeout>60000</extractTimeout>
   <tikaConfig>config/my-tika-config.xml</tikaConfig>
   <!-- note that all of the default mappings are included automatically -->
   <metadataMap>
    <map from="Keywords" to="newKeywordsField"/>
    <map from="description" to="newDescriptionField"/>
   </metadataMap>
 </component>

Example Output

<doc>
  <fetchUrl>http://www.searchtechnologies.com</fetchUrl> 
  .
  .
  .
  <title source="ExtractTextStage/title">Search Technologies: The independent enterprise search experts</title> 
  <description source="ExtractTextStage/description">We advise companies on enterprise search product selection, and we provide efficient, cost effective implementation and integration services. Search Technologies are the expert in the search space</description> 
  <language source="ExtractTextStage/language">en</language> 
  <extension source="ExtractTextStage">
    <field name="Content-Language">en</field> 
    <field name="Content-Encoding">ISO-8859-1</field> 
    <field name="resourceName">http://www.searchtechnologies.com</field> 
  </extension>
  <content source="ExtractTextStage">
  <![CDATA[ 
	Home
	About Us	Executive Team
	Careers
	Solutions	Enterprise Search Consulting
	Microsoft/Fast ESP Services
	Google Search Appliance
	Open Source Enterprise Search
	SharePoint Search
	RetrievalWare Support & Migration
	Image Management
  .
  .
  .
  </content>
</doc>

Page tree