The extract-text component takes the input stream or the input array of bytes and uses Apache Tika to extract the text and metadata from the stream.
Extract Text | |
---|---|
Factory Name | com.searchtechnologies.aspire:aspire-extract-text |
subType | default |
Inputs | object['contentStream'] or object['contentBytes'] (the content to be parsed) |
Outputs | <content> holds the text content extracted from the document. Other metadata output is available and is mapped with the metadata mapper (see below). |
The method for determining which Apache Tika text extractor to use is as follows:
If the extraction takes too long, then the thread which is doing the extraction will be forcibly stopped using Thread.interrupt(), and if that doesn't work (after three retries), Thread.stop() with a forced NullPointer exception.
This was done because Apache Tika contains bugs which cause infinite loops for some types of HTML documents.
The situation should be carefully monitored, because if too many of these exceptions occur, Aspire could become unstable.
Element | Type | Default | Description |
---|---|---|---|
extractTimeout | int | 180000 (3 minutes) | Maximum time to wait (in ms) for the text extraction (Maximum value 180000000 equals 3000 minutes). |
maxCharacters | int/String | 1,000,000 | Maximum number characters to extract from the document. If the limit is exceeded, the extracted text will be truncated. Use a numeric value or "unlimited." |
metadataMap | see below | Standard Metadata Mapper configuration. See below. | |
wordPerTag | boolean | true | (2.2.1 Release) If words are to be split per XML/HTML tag |
The Extract Text stage contains a large number of additional metadata fields which can be mapped to fields in the AspireObject XML. See Apache Tika for a description of all of the metadata fields extracted. The following ones will be mapped by default. Note that the mappings are specified in order, a higher-level mapping will be preferred over a lower mapping if both are possible.
For more information on metadata formats used below, see:
Apache Tika Field | Default Output Field | Description |
---|---|---|
DC.title | title | The Dublin Core title of the document. |
DC.date | modificationDate | Dublin Core last modified date, converted to ISO 8601 format. |
DC.description | description | Dublin Core description. |
DC.contributor | contributor | Dublin Core contributor name. |
title | title | Any other title (such as PDF title or HTML title) that Apache Tika is able to extract from the document. |
created | creationDateTime | The creation date-time, typically from PDF properties. Formatted as an ISO 8601 date-time. |
Last-Modified | modificationDateTime | Last modified date-time. Formatted as an ISO 8601 date-time. |
Author | author | Author name. Typically from PDF document properties. |
Content-Type | contentType | The HTTP formatted content type of the document. |
description | description | Document description from either HTML meta fields or PDF document properties. |
language | language | Auto-detected language code from Apache Tika. |
Keywords | keywords | Keywords field from either HTML meta fields or PDF document properties. |
<component name="ExtractText" subType="default" factoryName="aspire-extract-text" />
<component name="ExtractText" subType="default" factoryName="aspire-extract-text"> <extractTimeout>60000</extractTimeout> <tikaConfig>config/my-tika-config.xml</tikaConfig> <!-- note that all of the default mappings are included automatically --> <metadataMap> <map from="Keywords" to="newKeywordsField"/> <map from="description" to="newDescriptionField"/> </metadataMap> </component>
<doc> <fetchUrl>http://www.searchtechnologies.com</fetchUrl> . . . <title source="ExtractTextStage/title">Search Technologies: The independent enterprise search experts</title> <description source="ExtractTextStage/description">We advise companies on enterprise search product selection, and we provide efficient, cost effective implementation and integration services. Search Technologies are the expert in the search space</description> <language source="ExtractTextStage/language">en</language> <extension source="ExtractTextStage"> <field name="Content-Language">en</field> <field name="Content-Encoding">ISO-8859-1</field> <field name="resourceName">http://www.searchtechnologies.com</field> </extension> <content source="ExtractTextStage"> <![CDATA[ Home About Us Executive Team Careers Solutions Enterprise Search Consulting Microsoft/Fast ESP Services Google Search Appliance Open Source Enterprise Search SharePoint Search RetrievalWare Support & Migration Image Management . . . </content> </doc>