Configuration
Element | Type | Default | Description |
---|
postUrl | string | http://localhost:8983/solr/update | A semicolon separated list of the URLs to which the resulting, transformed XML file will be posted. The exact URL to post to is selected based on round robin or a deterministic algorithm based configuration. |
deterministic | boolean | false | If true, the server URL selected will be deterministic based on the document id, and the same document will always be sent to the same host. If false, the sever URL will be selected based on round robin and will send the document to the first available host. |
idPath | String | /doc/fetchUrl | When using deterministic round robin, obtain the document id from the given xPath of the document. |
broadcast | boolean | false | If true, the document will be sent to all configured servers (one after the other). |
postString | string | | Instead of posting the transform of the incoming Aspire Object document, post this string instead. When specified, the document XML is not transformed, nor is it posted. Only the postString is sent to the remote server. Can be useful for doing things like a SOLR commit or some other notification. |
postXsl | string | | The XSL transform file to be used to transform the incoming Aspire Object document into the XML which is posted to the remote server. There is no default. It must be specified unless postString is specified. Note that this file will be made to be relative to ASPIRE_HOME. |
postJsonTransform | string | | The JSON transform file to be used to transform the incoming Aspire Object document into the JSON which is posted to the remote server. There is no default. It must be specified unless postString is specified. Note that this file will be made to be relative to ASPIRE_HOME. See Post JSON. |
debugOutFile | string | | For debugging purposes, the transformed document will also be appended to this output file as well as to the remote server. Creates multiple debug-out files, one for every open thread. Further, old debug-out files will no longer be overwritten (files are written with "-###" suffix attached to them, after the main file name but before the ".xxx" extension). Therefore, you may want to store debug out files in the data or logs directory to avoid cluttering up your filesystem. Allows a semi-colon separated list of debug-out files. For example <debugOutFile>/mnt/out1/debug-out.txt;/mnt/out2/debug-out.txt</debugOutFile> This will automatically round-robin new debug out files to the different locations. If the different locations are on separate hard drives, then IO output performance can be vastly improved. |
okayResponse | string | <int name="status">0</int> | The response from the remote server will be scanned for this string. If it exists, then it will be assumed that the posting was successful. If it failed, then an error will be returned for the document. |
readTimeout | int | 300000ms (5 minutes) | Specifies the read timeout for the HTTP Connection - how long to wait before the server responds. |
connectionTimeout | int | 60000ms (1 minute) | The connection timeout, how long to wait for the servers to respond to a connection request. |
maxTries | int | 3 | The number of times to try submitting the document. If a submit fails, a different URL will be selected from the list of URLs (if the number of URLs is greater than on) and the document will be resubmitted. |
retryWait | int | 1000ms (1 second) | The time to sleep between submissions of failed documents. |
multipartForm | parentTag | | Posts the transformed output as a multipart form, with name/value pairs written to the POST stream (as HTTP headers) before the content itself. Name/value pairs are specified with <multipartForm><param> elements. |
multipartForm/@contentParam | String | data | Specifies the parameter name to hold the content of the transformed output of the job's document - i.e. the content of the XML or JSON itself. |
multipartForm/param and param/@name | String | | Holds parameter name/value pairs of form data to send to the HTTP server. Note that values are specified as the content of the <param> tag, and can be encoded using substitutions from the Simple Templates method. |
saxonProcessor | boolean | false | Set on true if you want to use SAXON processors (which support XSLT 2.0). | |
authentication | String | "none" | Indicates what type of authentication that must be used.("none" no authentication, "basic" Basic authentication with encode Base64) | |
username | String | null | Sets the username, in case the authentication is needed. | |
password | String | null | Sets the password, in case the authentication is needed. | |
contentType | String | null | Sets the content-type header to be sent to server. Ignored when sending multi-part forms. Example: "text/xml". |
requestProperties | | see bellow | Configurable HTTP request properties. Such as "user-agent". |
maxResults | Integer | 2^(31)-1 (Maximum integer allowed) | (Index dump) How many documents can be fetched by the search engine for the same query |
pageSize | Integer | 10000 | (Index dump) How many documents to fetch per page |
urlField | String | displayUrl | (Index dump) Field used to store the url in the search engine |
idField | String | id | (Index dump) Field used to store the id in the search engine. |
timestampField | String | submitTS | (Index dump) The name of the timestamp field holding the index timestamp of every document. |
Request Properties Configuration
Specially useful to set custom or specialized security tokens before a post operation.
Field/Attribute | Description |
---|
requestProperty{@name} | Name of the request property. |
requestProperty | Value of the request property. |
XSLT Transform
Example Configuration
<component name="PostHTTP" subType="default" factoryName="aspire-post-http">
<postUrl>http://localhost:8983/solr/update</postUrl>
<postXsl>config/xsl/aspireToSolr.xsl</postXsl>
<okayResponse><![CDATA[<int name="status">0</int>]]></okayResponse>
</component> |
Example Configuration with Basic Authentication
<component name="PostHTTP" subType="default" factoryName="aspire-post-http">
<postUrl>http://localhost:8983/solr/update</postUrl>
<postXsl>config/xsl/aspireToSolr.xsl</postXsl>
<okayResponse><![CDATA[<int name="status">0</int>]]></okayResponse>
<authentication>basic</authentication>
<username>admin</username>
<password>pass</password>
</ |
component>
Multi Server Configuration Example
<component name="PostHTTP" subType="default" factoryName="aspire-post-http">
<postUrl>http://server1:8983/solr/update;
http://server2:8983/solr/update;
http://server3:8983/solr/update;
http://server4:8983/solr/update
</postUrl>
<postXsl>config/xsl/aspireToSolr.xsl</postXsl>
<okayResponse><![CDATA[<int name="status">0</int>]]></okayResponse>
<maxTries>10</maxTries>
<retryWait>10000</retryWait>
</ |
component>
Commit Example
Example using the <postString> method to automatically post a commit command to SOLR. This is typically used after the "WaitForSubJobs" component in the parent pipeline.
Code Block |
---|
|
<component name="SolrCommit" subType="default" factoryName="aspire-post-http">
<postUrl>http://localhost:8983/solr/update</postUrl>
<contentType>text/xml</contentType>
<postString>
<![CDATA[
<commit/>
]]>
</postString>
</component> |
Multi-Part Form Example
Useful for writing to the Google Search Appliance (GSA).
Code Block |
---|
|
<component name="MultipartPost" subType="default" factoryName="aspire-post-http">
<postUrl>${gsaFeedUrl}</postUrl>
<fixedLengthOutput>true</fixedLengthOutput>
<postXsl>config/xsl/aspireToGSA.xsl</postXsl>
<multipartForm contentParam="data">
<param name="datasource">This is the datasource value</param>
<param name="feedtype">{XML:feedValue}</param>
</multipartForm>
</component> |
Batching XML
All you need, is to set up the Branch Handler to use batching. All jobs that get to the stage (for example they come from a sub job extractor) will be ready to be batched when they get to PostHTTP.
Once you set up the branch handler, then set this two additional parameters on PostHTTP:
Element | Type | Default | Description |
---|
postHeader | String | empty string | String that is wrote in the stream before the first document is received. This consists of the required feed headers for the target search engine or application. |
postFooter | String | empty string | String that is wrote in the stream after closing the batch. This consists of the required feed footer for the target search engine or application. |
Example
Sample application XML configuration:
Code Block |
---|
|
<?xml version="1.0" encoding="UTF-8"?>
<application name="FeedOneExample">
<components>
<component name="StandardPipeManager" subType="pipeline" factoryName="aspire-application">
<components>
<component name="FetchUrl" subType="default" factoryName="aspire-fetch-url" />
<component name="WaitForSubJobs" subType="waitForSubJobs" factoryName="aspire-tools"/>
<component name="XMLSubJobExtract" subType="xmlSubJobExtractor" factoryName="aspire-xml-files">
<branches>
<branch event="onSubJob" pipelineManager="."
pipeline="subJobs-process"
batching="true"
batchSize="1000"
batchTimeout="1000"
simultaneousBatches="2" />
</branches>
</component>
<component name="PostToGSA" subType="default" factoryName="aspire-post-http">
<postUrl>${gsaFeedUrl}</postUrl>
<postXsl>config/xsl/aspireToGSA.xsl</postXsl>
<okayResponse>Success</okayResponse>
<debugOutFile>data/debug/gsa.txt</debugOutFile>
<postHeader><![CDATA[<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "gsafeed.dtd"><gsafeed><header><datasource>Macomb_poc_feed</datasource><feedtype>incremental</feedtype></header><group action="add">]]></postHeader>
<postFooter><![CDATA[</group></gsafeed>]]></postFooter>
<multipartForm contentParam="data">
<param name="datasource">Macomb_poc_feed</param>
<param name="feedtype">incremental</param>
</multipartForm>
</component>
</components>
<pipelines>
<pipeline name="doc-process" default="true">
<stages>
<stage component="XMLSubJobExtract" />
</stages>
</pipeline>
<pipeline name="subJobs-process">
<stages>
<stage component="PostToGSA" />
</stages>
</pipeline>
</pipelines>
</component>
</components>
</application> |
Dynamic Request Properties
Besides setting static request properties on initialize through the component's configuration (see above), request properties can be dynamically set through the AspireObject of the incoming job.
Request properties in the AspireObject are read from the structure:
Code Block |
---|
|
<doc>
<requestProperties>
<requestProperty name="PROP_NAME">PROP_VALUE</requestProperty>
<requestProperty name="PROP_NAME2">PROP_VALUE2</requestProperty>
...
</requestProperties>
</doc> |
When working with Aspire Batches the values of the first job of the batch will be the ones used to open the connection with the server.
Feed to the GSA example (configuration and XSL)
This section provides an example of PostXml configuration and a XSL template that may be useful for feeding documents to the GSA.
Configure aspire-post-http to use multipart form option. This will prevent the GSA from rejecting the feed because of wrong encodings. Example:
Code Block |
---|
|
<component name="PostAddOrUpdateToGSA" subType="default" factoryName="aspire-post-http">
<config>
<postUrl>${gsaFeedUrl}</postUrl>
<postXsl>config/xsl/aspireToGSA.xsl</postXsl>
<okayResponse>Success</okayResponse>
<debugOutFile>data/debug/gsa.out</debugOutFile>
<multipartForm contentParam="data">
<param name="datasource">ppp_feed</param>
<param name="feedtype">incremental</param>
</multipartForm>
</config>
</component> |
Notes: of is in Check and for - for more details on the feed XML format.
- okayResponse
|
is - is configured to match the response from GSA.
- debugOutFile
|
is - is optional, it that file you can see the transformed documents (as they are sent to the GSA).
- mulitpartForm->datasource: Your feed will show with this name under “Crawl and Index->Feeds” section on GSA administration.
- multipartForm->feedType: The GSA will keep versions of the same document.
- Elasticsearch
|
uses as - as the URL for batch indexing.
|
See for - for further information about how to use bulk indexing.
- Elasticsearch returns HTTP 201 for simple indexing (not bulk).
|
Common issues (and how they are normally fixed)
- There is a feed for each document. Is this normal? Yes, this is normal. This is the most simple feed scenario, one document per feed XML sent to the GSA. If you want more than one feed, checkout the section above to see how to enable batching in branch handler. There is a noticeable performance improvement when batches of documents are sent to the GSA.
- PostHTTP returns error 401 for any feed. Check that the Aspire machine is on the list of List of Trusted IP Addresses in “Crawl and Index->Feeds” on GSA administration. Or that Trust feeds from all IP addresses is selected.
- GSA rejects the feed without even opening it. Check that the fed URLs match at least one expression of Start Crawling from the Following URLs in “Crawl and Index->Crawl URLs” on GSA administration.
- GSA feed shows error “Missing or invalid content” or “Content attribute not properly specified” messages: This is likely a problem with the XSLT. Check that there are no <meta name="someField" content=””> entries on the generated feedXml (in newer versions of the GSA you can download the feed from GSA administration). This is commonly because the XSL is extracting a field that is empty or didn’t exist on the AspireDocument (AspireObject).
JSON Transform
JSON transformers are Groovy Scripts that use JSON Builders to create JSON objects from AspireObjects as input. Further information about JSON transformers syntax at Post JSON
Elasticsearch Indexing Example
Single document indexing (without batching).
<component name="PostElasticsearch" subType="default" factoryName="aspire-post-http">
<postUrl>http://localhost:9200/testindex/testtype/</postUrl>
<postJsonTransform>config/json/aspireToElasticsearch.groovy</postJsonTransform>
<okayResponse><![CDATA[{"ok":true]]></okayResponse>
</component>
Elasticsearch Indexing with Basic Authentication Example
Single document indexing (without batching).
<component name="PostElasticsearch" subType="default" factoryName="aspire-post-http">
<postUrl>http://localhost:9200/testindex/testtype/</postUrl>
<postJsonTransform>config/json/aspireToElasticsearch.groovy</postJsonTransform>
<okayResponse><![CDATA[{"ok":true]]></okayResponse>
<authentication>basic</authentication>
<username>admin</username>
<password>pass</password>
</component>
Elasticsearch Bulk Indexing
<component name="PostElasticsearch" subType="default" factoryName="aspire-post-http">
<postUrl>http://localhost:9200/_bulk</postUrl>
<postJsonTransform>config/json/aspireToElasticsearchBulk.groovy</postJsonTransform>
<okayResponse><![CDATA[{"took":]]></okayResponse>
</component>