The Fetch URL stage opens an InputStream to the given URL which can be read by down-stream pipeline stages.
The <httpResponse> element will contain the HTTP response information if the protocol was an "http://" protocol. For example:
<httpResponse code="200" source="FetchURLStage">OK</httpResponse>
Fetch URL | |
---|---|
Factory Name | com.searchtechnologies.aspire:aspire-fetch-url |
subType | default |
Inputs | <fetchUrl> (if it exists) or <url> |
Outputs | Sets job variable 'contentStream', to an InputStream object ready to be read. HTTP headers are mapped to output elements using the metadata mapper (see below), and an element, <httpResponse> is also created for HTTP URLs. |
Element | Type | Default | Description |
---|---|---|---|
connectionTimeout | int | 600000 (10 minutes) | Maximum time to wait (in ms) for establishing a connection to the remote server. |
readTimeout | int | 600000 (10 minutes) | Maximum time to wait (in ms) for reading the entire content. |
enableRedirects | boolean | true | Sets whether HTTP redirects (requests with response code 3xx) should be automatically followed by the Fetch URL stage. See here for details. |
maxBytes | int | 10485760 (10 MB) | Specifies the maximum number of bytes to read from the URL. |
method | String | GET | The method for posting CGI parameters to the remote server. Either POST or GET. This configuration element is ignored for non-HTTP connections. In the POST case, all query parameters will be detached from the URL and submitted as the request body. |
requestProperties | see bellow | Configurable HTTP request properties. Such as "user-agent". | |
fetchUrlPath | String | doc/fetchUrl | The path to the element in the AspireObject that contains the URL to fetch. |
metadataMap | see below | Standard Metadata Mapper configuration. See below. |
The fetch URL stage contains a large number of additional metadata fields which can be mapped to fields in the AspireObject XML.
Field | Default Output Field | Description |
---|---|---|
protocol | protocol | The protocol of the URL (for example, "http" for "http://www.searchtechnologies.com"). |
host | host | The host name of the URL (for example, "www.searchtechnologies.com" for "http://www.searchtechnologies.com"). |
mimeType | mimeType | The mime type returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "text/html". |
encoding | encoding | The content encoding as returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "UTF-8". |
expirationDate | expirationDate | The expiration date reported by the HTTP server in the "expires" http header, if it exists. Formatted as an ISO 8601 date-time. |
modificationDate | modificationDate | The modification date reported by the HTTP server in the "last-modified" http header, if it exists. Formatted as an ISO 8601 date-time. |
redirectUrl | redirectUrl | If the HTTP server reported a 3XX code and the URL was automatically redirected to another URL, this element provides the new URL. |
status | - | The HTTP response status message. For example, "HTTP/1.1 200 OK". |
all other HTTP headers | - | Note that any HTTP header is available to be mapped by the metadata mapper. All headers not mapped are automatically put into the <extension> area. |
Some URLs are not accessible if some request properties are not set.
Field/Attribute | Description |
---|---|
requestProperty{@name} | Name of the request property. |
requestProperty | Value of the request property. |
<component name="FetchUrl" subType="default" factoryName="aspire-fetch-url" />
<component name="FetchUrl" subType="default" factoryName="aspire-fetch-url"> <connectionTimeout>1000</connectionTimeout> <maxBytes>1000000</maxBytes> <!-- note that all of the default mappings are included automatically --> <metadataMap> <map from="Cache-Control" to="cacheControl"/> <map from="Server" to="server"/> <map from="Set-Cookie" to="cookieValue"/> </metadataMap> <requestProperties> <requestProperty name="user-agent">aspire/fetchUrl 1.2</requestProperty> </requestProperties> </component>
<doc> <fetchUrl>http://www.searchtechnologies.com</fetchUrl> <httpResponse code="200" source="FetchURLStage">OK</httpResponse> <protocol source="FetchURLStage/protocol">http</protocol> <host source="FetchURLStage/host">www.searchtechnologies.com</host> <mimeType source="FetchURLStage/mimeType">text/html</mimeType> <encoding source="FetchURLStage/encoding">utf-8</encoding> <extension source="FetchURLStage"> <field name="status">HTTP/1.1 200 OK</field> <field name="Date">Wed, 02 Dec 2009 15:05:24 GMT</field> <field name="Server">Microsoft-IIS/6.0</field> <field name="X-Powered-By">ASP.NET</field> <field name="X-AspNet-Version">2.0.50727</field> <field name="Set-Cookie">ASP.NET_SessionId=vkprqxru0k2gjy455o1j31u3; path=/; HttpOnly</field> <field name="Cache-Control">private</field> <field name="Content-Type">text/html; charset=utf-8</field> <field name="Content-Length">9584</field> </extension> . . . </doc>
Note: The actual document content is sent down the pipeline as a java InputStream, which can be accessed from the job object via the "contentStream" variable.
If you're fetching files via https://, you may encounter issues if the certificate the server is using is not properly signed.
Typically you'll see an exception such as:
AspireException(aspire.FetchURLStage.other-connect-error): com.searchtechnologies.aspire.services.AspireException: Unable to open connection to URL "https://server:8443/path/file". (component='/fastProxyServer/queryPipeManager/queryFast', componentFactory='aspire-fetch-url') at com.searchtechnologies.aspire.docprocessing.fetchurl.FetchURLStage.process(FetchURLStage.java:284) at com.searchtechnologies.aspire.application.JobHandler.runNested(JobHandler.java:114) at com.searchtechnologies.aspire.application.JobHandler.run(JobHandler.java:52) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Unknown Source) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(Unknown Source) at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Unknown Source) at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Unknown Source) at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(Unknown Source) at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(Unknown Source) at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Unknown Source) at com.sun.net.ssl.internal.ssl.Handshaker.process_record(Unknown Source) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(Unknown Source) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(Unknown Source) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(Unknown Source) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(Unknown Source) at sun.net.www.protocol.https.HttpsClient.afterConnect(Unknown Source) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(Unknown Source) at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(Unknown Source) at com.searchtechnologies.aspire.docprocessing.fetchurl.FetchURLStage.process(FetchURLStage.java:185) ... 5 more Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.validator.PKIXValidator.doBuild(Unknown Source) at sun.security.validator.PKIXValidator.engineValidate(Unknown Source) at sun.security.validator.Validator.validate(Unknown Source) at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.validate(Unknown Source) at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.checkServerTrusted(Unknown Source) at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.checkServerTrusted(Unknown Source) ... 17 more Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(Unknown Source) at java.security.cert.CertPathBuilder.build(Unknown Source) ... 23 more
In order to fetch these pages, you need to import the certificate from the offending server to a keystore and then configure aspire to use that keystore.
keytool -import -alias myDomain -file myDomain.cer -trustcacerts -keystore \path\myKeystore
-Djavax.net.ssl.trustStore=C:\path\myKeystore
For example:
java -Djavax.net.ssl.trustStore=C:\path\myKeystore -Xmx250m -Xms250m %FELIX_CONFIG_PROP% "%ASPIRE_HOME_PROP%" -jar bin\felix.jar