The Fetch URL stage opens an InputStream to the given URL which can be read by down-stream pipeline stages.

  • Fetch URL stage will prefix the URL with "http://" if there is no URL protocol specified.
  • Fetch URL can open streams on file system files with the file:// protocol (often used)

Other Outputs

The <httpResponse> element will contain the HTTP response information if the protocol was an "http://" protocol. For example:

<httpResponse code="200" source="FetchURLStage">OK</httpResponse>
Fetch URL
Factory Namecom.searchtechnologies.aspire:aspire-fetch-url
subType

default

Inputs<fetchUrl> (if it exists) or <url>
OutputsSets job variable 'contentStream', to an InputStream object ready to be read. HTTP headers are mapped to output elements using the metadata mapper (see below), and an element, <httpResponse> is also created for HTTP URLs.

Configuration

ElementTypeDefaultDescription
connectionTimeoutint600000
(10 minutes)
Maximum time to wait (in ms) for establishing a connection to the remote server.
readTimeoutint600000
(10 minutes)
Maximum time to wait (in ms) for reading the entire content.
enableRedirectsbooleantrueSets whether HTTP redirects (requests with response code 3xx) should be automatically followed by the Fetch URL stage. See here for details.
maxBytesint10485760
(10 MB)
Specifies the maximum number of bytes to read from the URL.
methodStringGETThe method for posting CGI parameters to the remote server. Either POST or GET. This configuration element is ignored for non-HTTP connections. In the POST case, all query parameters will be detached from the URL and submitted as the request body.
requestProperties see bellowConfigurable HTTP request properties. Such as "user-agent".
fetchUrlPathStringdoc/fetchUrlThe path to the element in the AspireObject that contains the URL to fetch.
metadataMap see belowStandard Metadata Mapper configuration. See below.

 

Metadata Mapper Configuration

The fetch URL stage contains a large number of additional metadata fields which can be mapped to fields in the AspireObject XML.

 

FieldDefault Output FieldDescription
protocolprotocolThe protocol of the URL (for example, "http" for "http://www.searchtechnologies.com").
hosthostThe host name of the URL (for example, "www.searchtechnologies.com" for "http://www.searchtechnologies.com").
mimeTypemimeTypeThe mime type returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "text/html".
encodingencodingThe content encoding as returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "UTF-8".
expirationDateexpirationDateThe expiration date reported by the HTTP server in the "expires" http header, if it exists. Formatted as an ISO 8601 date-time.
modificationDatemodificationDateThe modification date reported by the HTTP server in the "last-modified" http header, if it exists. Formatted as an ISO 8601 date-time.
redirectUrlredirectUrlIf the HTTP server reported a 3XX code and the URL was automatically redirected to another URL, this element provides the new URL.
status-The HTTP response status message. For example, "HTTP/1.1 200 OK".
all other HTTP headers-Note that any HTTP header is available to be mapped by the metadata mapper. All headers not mapped are automatically put into the <extension> area.

Request Properties Configuration

Some URLs are not accessible if some request properties are not set.

Field/AttributeDescription
requestProperty{@name}Name of the request property.
requestPropertyValue of the request property.

Example Configurations

Simple

 <component name="FetchUrl" subType="default" factoryName="aspire-fetch-url" />

Complex

 <component name="FetchUrl" subType="default" factoryName="aspire-fetch-url">
   <connectionTimeout>1000</connectionTimeout>
   <maxBytes>1000000</maxBytes>
   <!-- note that all of the default mappings are included automatically -->
   <metadataMap>
    <map from="Cache-Control" to="cacheControl"/>
     <map from="Server" to="server"/>
    <map from="Set-Cookie" to="cookieValue"/>
   </metadataMap>
   <requestProperties>
    <requestProperty name="user-agent">aspire/fetchUrl 1.2</requestProperty>
   </requestProperties>
 </component>

Example Output

<doc>
  <fetchUrl>http://www.searchtechnologies.com</fetchUrl> 
  <httpResponse code="200" source="FetchURLStage">OK</httpResponse> 
  <protocol source="FetchURLStage/protocol">http</protocol> 
  <host source="FetchURLStage/host">www.searchtechnologies.com</host> 
  <mimeType source="FetchURLStage/mimeType">text/html</mimeType> 
  <encoding source="FetchURLStage/encoding">utf-8</encoding> 
  <extension source="FetchURLStage">
    <field name="status">HTTP/1.1 200 OK</field> 
    <field name="Date">Wed, 02 Dec 2009 15:05:24 GMT</field> 
    <field name="Server">Microsoft-IIS/6.0</field> 
    <field name="X-Powered-By">ASP.NET</field> 
    <field name="X-AspNet-Version">2.0.50727</field> 
    <field name="Set-Cookie">ASP.NET_SessionId=vkprqxru0k2gjy455o1j31u3; path=/; HttpOnly</field> 
    <field name="Cache-Control">private</field> 
    <field name="Content-Type">text/html; charset=utf-8</field> 
    <field name="Content-Length">9584</field> 
  </extension>
  .
  .
  .
</doc>

Note: The actual document content is sent down the pipeline as a java InputStream, which can be accessed from the job object via the "contentStream" variable.

Fetching via https://

If you're fetching files via https://, you may encounter issues if the certificate the server is using is not properly signed.

Typically you'll see an exception such as:

  AspireException(aspire.FetchURLStage.other-connect-error): com.searchtechnologies.aspire.services.AspireException: Unable to open connection to URL "https://server:8443/path/file". (component='/fastProxyServer/queryPipeManager/queryFast', componentFactory='aspire-fetch-url')
        at com.searchtechnologies.aspire.docprocessing.fetchurl.FetchURLStage.process(FetchURLStage.java:284)
        at com.searchtechnologies.aspire.application.JobHandler.runNested(JobHandler.java:114)
        at com.searchtechnologies.aspire.application.JobHandler.run(JobHandler.java:52)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
  Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
        at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Unknown Source)
        at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(Unknown Source)
        at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Unknown Source)
        at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Unknown Source)
        at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(Unknown Source)
        at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(Unknown Source)
        at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Unknown Source)
        at com.sun.net.ssl.internal.ssl.Handshaker.process_record(Unknown Source)
        at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(Unknown Source)
        at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(Unknown Source)
        at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(Unknown Source)
        at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(Unknown Source)
        at sun.net.www.protocol.https.HttpsClient.afterConnect(Unknown Source)
        at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(Unknown Source)
        at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(Unknown Source)
        at com.searchtechnologies.aspire.docprocessing.fetchurl.FetchURLStage.process(FetchURLStage.java:185)
        ... 5 more
  Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
        at sun.security.validator.PKIXValidator.doBuild(Unknown Source)
        at sun.security.validator.PKIXValidator.engineValidate(Unknown Source)
        at sun.security.validator.Validator.validate(Unknown Source)
        at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.validate(Unknown Source)
        at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.checkServerTrusted(Unknown Source)
        at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.checkServerTrusted(Unknown Source)
        ... 17 more
  Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
        at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(Unknown Source)
        at java.security.cert.CertPathBuilder.build(Unknown Source)
        ... 23 more

In order to fetch these pages, you need to import the certificate from the offending server to a keystore and then configure aspire to use that keystore.

Using a web browser

  • Export the certificate (IE):
    • Connect to https://my.domain.com
    • Go to Tools > Internet Options > Content > Certificates > Intermediate Certification Authorities [or "Trusted Root Certification Authorities"]
    • Choose whichever certificate is needed
    • Click “Export…”, then “Next>”
    • Select “DER encoded binary X.509 (.CER)”
    • Name the file myDomain.cer [change the name as applicable]
    • Select “Finish”
  • Install the certificate:
keytool -import -alias myDomain -file myDomain.cer -trustcacerts -keystore \path\myKeystore
  • Configure Felix to use the keystore by adding the following to the java command line:
-Djavax.net.ssl.trustStore=C:\path\myKeystore

For example:

java -Djavax.net.ssl.trustStore=C:\path\myKeystore -Xmx250m -Xms250m %FELIX_CONFIG_PROP% "%ASPIRE_HOME_PROP%" -jar bin\felix.jar
  • No labels