The Post HTTP Stage stage applies an XSLT or a JSON transform to the input AspireObject and then posts the resulting transformed XML or JSON to a remote RESTful interface via HTTP. The server is selected from the list using round robin or a deterministic selection to ensure that a single document will only be sent to one server.
Post HTTP | |
---|---|
Factory Name | com.searchtechnologies.aspire:aspire-post-http |
subType | default |
Inputs | An Aspire Object with the metadata of each document to be posted. |
Outputs | A transformed XML, JSON or just plain text, which is then posted to a remote server via a RESTful interface. |
Element | Type | Default | Description | |
---|---|---|---|---|
postUrl | string | http://localhost:8983/solr/update | A semicolon separated list of the URLs to which the resulting, transformed XML file will be posted. The exact URL to post to is selected based on round robin or a deterministic algorithm based configuration. | |
deterministic | boolean | false | If true, the server URL selected will be deterministic based on the document id, and the same document will always be sent to the same host. If false, the sever URL will be selected based on round robin and will send the document to the first available host. | |
idPath | String | /doc/fetchUrl | When using deterministic round robin, obtain the document id from the given xPath of the document. | |
broadcast | boolean | false | If true, the document will be sent to all configured servers (one after the other). | |
postString | string | Instead of posting the transform of the incoming Aspire Object document, post this string instead. When specified, the document XML is not transformed, nor is it posted. Only the postString is sent to the remote server. Can be useful for doing things like a SOLR commit or some other notification. | ||
postXsl | string | The XSL transform file to be used to transform the incoming Aspire Object document into the XML which is posted to the remote server. There is no default. It must be specified unless postString is specified. Note that this file will be made to be relative to ASPIRE_HOME. | ||
postJsonTransform | string | The JSON transform file to be used to transform the incoming Aspire Object document into the JSON which is posted to the remote server. There is no default. It must be specified unless postString is specified. Note that this file will be made to be relative to ASPIRE_HOME. See Post JSON. | ||
debugOutFile | string | For debugging purposes, the transformed document will also be appended to this output file as well as to the remote server. Creates multiple debug-out files, one for every open thread. Further, old debug-out files will no longer be overwritten (files are written with "-###" suffix attached to them, after the main file name but before the ".xxx" extension). Therefore, you may want to store debug out files in the data or logs directory to avoid cluttering up your filesystem. Allows a semi-colon separated list of debug-out files. For example <debugOutFile>/mnt/out1/debug-out.txt;/mnt/out2/debug-out.txt</debugOutFile> This will automatically round-robin new debug out files to the different locations. If the different locations are on separate hard drives, then IO output performance can be vastly improved. | ||
okayResponse | string | <int name="status">0</int> | The response from the remote server will be scanned for this string. If it exists, then it will be assumed that the posting was successful. If it failed, then an error will be returned for the document. | |
readTimeout | int | 300000ms (5 minutes) | Specifies the read timeout for the HTTP Connection - how long to wait before the server responds. | |
connectionTimeout | int | 60000ms (1 minute) | The connection timeout, how long to wait for the servers to respond to a connection request. | |
maxTries | int | 3 | The number of times to try submitting the document. If a submit fails, a different URL will be selected from the list of URLs (if the number of URLs is greater than on) and the document will be resubmitted. | |
retryWait | int | 1000ms (1 second) | The time to sleep between submissions of failed documents. | |
multipartForm | parentTag | Posts the transformed output as a multipart form, with name/value pairs written to the POST stream (as HTTP headers) before the content itself. Name/value pairs are specified with <multipartForm><param> elements. | ||
multipartForm/@contentParam | String | data | Specifies the parameter name to hold the content of the transformed output of the job's document - i.e. the content of the XML or JSON itself. | |
multipartForm/param and param/@name | String | Holds parameter name/value pairs of form data to send to the HTTP server. Note that values are specified as the content of the <param> tag, and can be encoded using substitutions from the Simple Templates method. | ||
saxonProcessor | boolean | false | Set on true if you want to use SAXON processors (which support XSLT 2.0). | |
authentication | String | "none" | Indicates what type of authentication that must be used.("none" no authentication, "basic" Basic authentication with encode Base64) | |
username | String | null | Sets the username, in case the authentication is needed. | |
password | String | null | Sets the password, in case the authentication is needed. | |
contentType | String | null | Sets the content-type header to be sent to server. Ignored when sending multi-part forms. Example: "text/xml". | |
requestProperties | see bellow | Configurable HTTP request properties. Such as "user-agent". | ||
maxResults | Integer | 2^(31)-1 (Maximum integer allowed) | (Index dump) How many documents can be fetched by the search engine for the same query | |
pageSize | Integer | 10000 | (Index dump) How many documents to fetch per page | |
urlField | String | displayUrl | (Index dump) Field used to store the url in the search engine | |
idField | String | id | (Index dump) Field used to store the id in the search engine. | |
timestampField | String | submitTS | (Index dump) The name of the timestamp field holding the index timestamp of every document. |
Specially useful to set custom or specialized security tokens before a post operation.
Field/Attribute | Description |
---|---|
requestProperty{@name} | Name of the request property. |
requestProperty | Value of the request property. |
<component name="PostHTTP" subType="default" factoryName="aspire-post-http"> <postUrl>http://localhost:8983/solr/update</postUrl> <postXsl>config/xsl/aspireToSolr.xsl</postXsl> <okayResponse><![CDATA[<int name="status">0</int>]]></okayResponse> </component>
<component name="PostHTTP" subType="default" factoryName="aspire-post-http"> <postUrl>http://localhost:8983/solr/update</postUrl> <postXsl>config/xsl/aspireToSolr.xsl</postXsl> <okayResponse><![CDATA[<int name="status">0</int>]]></okayResponse> <authentication>basic</authentication> <username>admin</username> <password>pass</password> </component>
<component name="PostHTTP" subType="default" factoryName="aspire-post-http"> <postUrl>http://server1:8983/solr/update; http://server2:8983/solr/update; http://server3:8983/solr/update; http://server4:8983/solr/update </postUrl> <postXsl>config/xsl/aspireToSolr.xsl</postXsl> <okayResponse><![CDATA[<int name="status">0</int>]]></okayResponse> <maxTries>10</maxTries> <retryWait>10000</retryWait> </component>
Example using the <postString> method to automatically post a commit command to SOLR. This is typically used after the "WaitForSubJobs" component in the parent pipeline.
<component name="SolrCommit" subType="default" factoryName="aspire-post-http"> <postUrl>http://localhost:8983/solr/update</postUrl> <contentType>text/xml</contentType> <postString> <![CDATA[ <commit/> ]]> </postString> </component>
Useful for writing to the Google Search Appliance (GSA).
<component name="MultipartPost" subType="default" factoryName="aspire-post-http"> <postUrl>${gsaFeedUrl}</postUrl> <fixedLengthOutput>true</fixedLengthOutput> <postXsl>config/xsl/aspireToGSA.xsl</postXsl> <multipartForm contentParam="data"> <param name="datasource">This is the datasource value</param> <param name="feedtype">{XML:feedValue}</param> </multipartForm> </component>
All you need, is to set up the Branch Handler to use batching. All jobs that get to the stage (for example they come from a sub job extractor) will be ready to be batched when they get to PostHTTP.
Once you set up the branch handler, then set this two additional parameters on PostHTTP:
Element | Type | Default | Description |
---|---|---|---|
postHeader | String | empty string | String that is wrote in the stream before the first document is received. This consists of the required feed headers for the target search engine or application. |
postFooter | String | empty string | String that is wrote in the stream after closing the batch. This consists of the required feed footer for the target search engine or application. |
Sample application XML configuration:
<?xml version="1.0" encoding="UTF-8"?> <application name="FeedOneExample"> <components> <component name="StandardPipeManager" subType="pipeline" factoryName="aspire-application"> <components> <component name="FetchUrl" subType="default" factoryName="aspire-fetch-url" /> <component name="WaitForSubJobs" subType="waitForSubJobs" factoryName="aspire-tools"/> <component name="XMLSubJobExtract" subType="xmlSubJobExtractor" factoryName="aspire-xml-files"> <branches> <branch event="onSubJob" pipelineManager="." pipeline="subJobs-process" batching="true" batchSize="1000" batchTimeout="1000" simultaneousBatches="2" /> </branches> </component> <component name="PostToGSA" subType="default" factoryName="aspire-post-http"> <postUrl>${gsaFeedUrl}</postUrl> <postXsl>config/xsl/aspireToGSA.xsl</postXsl> <okayResponse>Success</okayResponse> <debugOutFile>data/debug/gsa.txt</debugOutFile> <postHeader><![CDATA[<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "gsafeed.dtd"><gsafeed><header><datasource>Macomb_poc_feed</datasource><feedtype>incremental</feedtype></header><group action="add">]]></postHeader> <postFooter><![CDATA[</group></gsafeed>]]></postFooter> <multipartForm contentParam="data"> <param name="datasource">Macomb_poc_feed</param> <param name="feedtype">incremental</param> </multipartForm> </component> </components> <pipelines> <pipeline name="doc-process" default="true"> <stages> <stage component="XMLSubJobExtract" /> </stages> </pipeline> <pipeline name="subJobs-process"> <stages> <stage component="PostToGSA" /> </stages> </pipeline> </pipelines> </component> </components> </application>
Besides setting static request properties on initialize through the component's configuration (see above), request properties can be dynamically set through the AspireObject of the incoming job.
Request properties in the AspireObject are read from the structure:
<doc> <requestProperties> <requestProperty name="PROP_NAME">PROP_VALUE</requestProperty> <requestProperty name="PROP_NAME2">PROP_VALUE2</requestProperty> ... </requestProperties> </doc>
When working with Aspire Batches the values of the first job of the batch will be the ones used to open the connection with the server.
This section provides an example of PostXml configuration and a XSL template that may be useful for feeding documents to the GSA.
Configure aspire-post-http to use multipart form option. This will prevent the GSA from rejecting the feed because of wrong encodings. Example:
<component name="PostAddOrUpdateToGSA" subType="default" factoryName="aspire-post-http"> <config> <postUrl>${gsaFeedUrl}</postUrl> <postXsl>config/xsl/aspireToGSA.xsl</postXsl> <okayResponse>Success</okayResponse> <debugOutFile>data/debug/gsa.out</debugOutFile> <multipartForm contentParam="data"> <param name="datasource">ppp_feed</param> <param name="feedtype">incremental</param> </multipartForm> </config> </component>
JSON transformers are Groovy Scripts that use JSON Builders to create JSON objects from AspireObjects as input. Further information about JSON transformers syntax at Post JSON
Single document indexing (without batching).
<component name="PostElasticsearch" subType="default" factoryName="aspire-post-http"> <postUrl>http://localhost:9200/testindex/testtype/</postUrl> <postJsonTransform>config/json/aspireToElasticsearch.groovy</postJsonTransform> <okayResponse><![CDATA[{"ok":true]]></okayResponse> </component>
Single document indexing (without batching).
<component name="PostElasticsearch" subType="default" factoryName="aspire-post-http"> <postUrl>http://localhost:9200/testindex/testtype/</postUrl> <postJsonTransform>config/json/aspireToElasticsearch.groovy</postJsonTransform> <okayResponse><![CDATA[{"ok":true]]></okayResponse> <authentication>basic</authentication> <username>admin</username> <password>pass</password> </component>
<component name="PostElasticsearch" subType="default" factoryName="aspire-post-http"> <postUrl>http://localhost:9200/_bulk</postUrl> <postJsonTransform>config/json/aspireToElasticsearchBulk.groovy</postJsonTransform> <okayResponse><![CDATA[{"took":]]></okayResponse> </component>