Page tree
Skip to end of metadata
Go to start of metadata

The Post HTTP Stage stage applies an XSLT or a JSON transform to the input AspireObject and then posts the resulting transformed XML or JSON to a remote RESTful interface via HTTP. The server is selected from the list using round robin or a deterministic selection to ensure that a single document will only be sent to one server.

  • If the remote server returns something other than HTTP 200 or 201 (in the HTTP headers), it will retry the post; sleeping one second between each try and failing the document after a set number of tries. If the component is configured to use round robin, retries for failures will attempt to pick a different server.
  • If the remote server returns HTTP 200 or 201, but the okay response string cannot be found in the response data, the document will be flagged as an error and should be quarantined.
  • The component also has an option for posting a fixed literal string to the remote server for doing other types of notifications, and supports job batching; see Branch Handler.
Post HTTP
Factory Namecom.searchtechnologies.aspire:aspire-post-http
subTypedefault
InputsAn Aspire Object with the metadata of each document to be posted.
OutputsA transformed XML, JSON or just plain text, which is then posted to a remote server via a RESTful interface.

Configuration


ElementTypeDefaultDescription
postUrlstringhttp://localhost:8983/solr/updateA semicolon separated list of the URLs to which the resulting, transformed XML file will be posted. The exact URL to post to is selected based on round robin or a deterministic algorithm based configuration.
deterministicbooleanfalseIf true, the server URL selected will be deterministic based on the document id, and the same document will always be sent to the same host. If false, the sever URL will be selected based on round robin and will send the document to the first available host.
idPathString/doc/fetchUrlWhen using deterministic round robin, obtain the document id from the given xPath of the document.
broadcastbooleanfalseIf true, the document will be sent to all configured servers (one after the other).
postStringstring Instead of posting the transform of the incoming Aspire Object document, post this string instead. When specified, the document XML is not transformed, nor is it posted. Only the postString is sent to the remote server. Can be useful for doing things like a SOLR commit or some other notification.
postXslstring The XSL transform file to be used to transform the incoming Aspire Object document into the XML which is posted to the remote server. There is no default. It must be specified unless postString is specified. Note that this file will be made to be relative to ASPIRE_HOME.
postJsonTransformstring The JSON transform file to be used to transform the incoming Aspire Object document into the JSON which is posted to the remote server. There is no default. It must be specified unless postString is specified. Note that this file will be made to be relative to ASPIRE_HOME. See Post JSON.
debugOutFilestring For debugging purposes, the transformed document will also be appended to this output file as well as to the remote server. Creates multiple debug-out files, one for every open thread. Further, old debug-out files will no longer be overwritten (files are written with "-###" suffix attached to them, after the main file name but before the ".xxx" extension). Therefore, you may want to store debug out files in the data or logs directory to avoid cluttering up your filesystem.

Allows a semi-colon separated list of debug-out files. For example <debugOutFile>/mnt/out1/debug-out.txt;/mnt/out2/debug-out.txt</debugOutFile> This will automatically round-robin new debug out files to the different locations. If the different locations are on separate hard drives, then IO output performance can be vastly improved.

okayResponsestring<int name="status">0</int>The response from the remote server will be scanned for this string. If it exists, then it will be assumed that the posting was successful. If it failed, then an error will be returned for the document.
readTimeoutint300000ms (5 minutes)Specifies the read timeout for the HTTP Connection - how long to wait before the server responds.
connectionTimeoutint60000ms (1 minute)The connection timeout, how long to wait for the servers to respond to a connection request.
maxTriesint3The number of times to try submitting the document. If a submit fails, a different URL will be selected from the list of URLs (if the number of URLs is greater than on) and the document will be resubmitted.
retryWaitint1000ms (1 second)The time to sleep between submissions of failed documents.
multipartFormparentTag Posts the transformed output as a multipart form, with name/value pairs written to the POST stream (as HTTP headers) before the content itself. Name/value pairs are specified with <multipartForm><param> elements.
multipartForm/@contentParamStringdataSpecifies the parameter name to hold the content of the transformed output of the job's document - i.e. the content of the XML or JSON itself.
multipartForm/param and param/@nameString Holds parameter name/value pairs of form data to send to the HTTP server. Note that values are specified as the content of the <param> tag, and can be encoded using substitutions from the Simple Templates method.
saxonProcessorbooleanfalseSet on true if you want to use SAXON processors (which support XSLT 2.0). 
authenticationString"none"Indicates what type of authentication that must be used.("none" no authentication, "basic" Basic authentication with encode Base64) 
usernameStringnullSets the username, in case the authentication is needed. 
passwordStringnullSets the password, in case the authentication is needed. 
contentTypeStringnullSets the content-type header to be sent to server. Ignored when sending multi-part forms. Example: "text/xml".
requestProperties  see bellowConfigurable HTTP request properties. Such as "user-agent".
maxResultsInteger2^(31)-1 (Maximum integer allowed)(Index dump) How many documents can be fetched by the search engine for the same query
pageSizeInteger10000(Index dump) How many documents to fetch per page
urlFieldStringdisplayUrl(Index dump) Field used to store the url in the search engine
idFieldStringid(Index dump) Field used to store the id in the search engine.
timestampField StringsubmitTS(Index dump) The name of the timestamp field holding the index timestamp of every document.

Request Properties Configuration

Specially useful to set custom or specialized security tokens before a post operation.

Field/AttributeDescription
requestProperty{@name}Name of the request property.
requestPropertyValue of the request property.

XSLT Transform


Example Configuration

 <component name="PostHTTP" subType="default" factoryName="aspire-post-http">
    <postUrl>http://localhost:8983/solr/update</postUrl>
    <postXsl>config/xsl/aspireToSolr.xsl</postXsl>
    <okayResponse><![CDATA[<int name="status">0</int>]]></okayResponse>
 </component>


Example Configuration with Basic Authentication

 <component name="PostHTTP" subType="default" factoryName="aspire-post-http">
    <postUrl>http://localhost:8983/solr/update</postUrl>
    <postXsl>config/xsl/aspireToSolr.xsl</postXsl>
    <okayResponse><![CDATA[<int name="status">0</int>]]></okayResponse>
    <authentication>basic</authentication>
    <username>admin</username>
    <password>pass</password>
 </component> 


Multi Server Configuration Example

 <component name="PostHTTP" subType="default" factoryName="aspire-post-http">
    <postUrl>http://server1:8983/solr/update;
             http://server2:8983/solr/update;
             http://server3:8983/solr/update;
             http://server4:8983/solr/update
    </postUrl>
    <postXsl>config/xsl/aspireToSolr.xsl</postXsl>
    <okayResponse><![CDATA[<int name="status">0</int>]]></okayResponse>
    <maxTries>10</maxTries>
    <retryWait>10000</retryWait>
 </component> 


Commit Example

Example using the <postString> method to automatically post a commit command to SOLR. This is typically used after the "WaitForSubJobs" component in the parent pipeline.

  <component name="SolrCommit" subType="default" factoryName="aspire-post-http">
    <postUrl>http://localhost:8983/solr/update</postUrl>
    <contentType>text/xml</contentType>
    <postString>
      <![CDATA[
        <commit/>
      ]]>
    </postString>
  </component>

 

Multi-Part Form Example

Useful for writing to the Google Search Appliance (GSA).

  <component name="MultipartPost" subType="default" factoryName="aspire-post-http">
    <postUrl>${gsaFeedUrl}</postUrl>
    <fixedLengthOutput>true</fixedLengthOutput>
    <postXsl>config/xsl/aspireToGSA.xsl</postXsl>
    <multipartForm contentParam="data">
      <param name="datasource">This is the datasource value</param>
      <param name="feedtype">{XML:feedValue}</param>
    </multipartForm>
  </component>

 

Batching XML


All you need, is to set up the Branch Handler to use batching. All jobs that get to the stage (for example they come from a sub job extractor) will be ready to be batched when they get to PostHTTP.

Once you set up the branch handler, then set this two additional parameters on PostHTTP:

ElementTypeDefaultDescription
postHeaderStringempty stringString that is wrote in the stream before the first document is received. This consists of the required feed headers for the target search engine or application.
postFooterStringempty stringString that is wrote in the stream after closing the batch. This consists of the required feed footer for the target search engine or application.

Example

Sample application XML configuration:

<?xml version="1.0" encoding="UTF-8"?>
<application name="FeedOneExample">
  
  <components>
    <component name="StandardPipeManager" subType="pipeline" factoryName="aspire-application">
      <components>
        <component name="FetchUrl" subType="default" factoryName="aspire-fetch-url" />

        <component name="WaitForSubJobs" subType="waitForSubJobs" factoryName="aspire-tools"/>

        <component name="XMLSubJobExtract" subType="xmlSubJobExtractor" factoryName="aspire-xml-files">
        <branches>
          <branch event="onSubJob" pipelineManager="." 
                  pipeline="subJobs-process" 
                  batching="true"
                  batchSize="1000"
                  batchTimeout="1000"
                  simultaneousBatches="2"  />
          </branches>
        </component>

        <component name="PostToGSA" subType="default" factoryName="aspire-post-http">
          <postUrl>${gsaFeedUrl}</postUrl>
          <postXsl>config/xsl/aspireToGSA.xsl</postXsl>
          <okayResponse>Success</okayResponse>
          <debugOutFile>data/debug/gsa.txt</debugOutFile>
          <postHeader><![CDATA[<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "gsafeed.dtd"><gsafeed><header><datasource>Macomb_poc_feed</datasource><feedtype>incremental</feedtype></header><group action="add">]]></postHeader>
          <postFooter><![CDATA[</group></gsafeed>]]></postFooter>
          <multipartForm contentParam="data">
            <param name="datasource">Macomb_poc_feed</param>
            <param name="feedtype">incremental</param>
          </multipartForm>
        </component>
          
      </components>
      <pipelines>

        <pipeline name="doc-process" default="true">
          <stages>
            <stage component="XMLSubJobExtract" />
          </stages>
        </pipeline>
		  
        <pipeline name="subJobs-process">
          <stages>
            <stage component="PostToGSA" />			  
          </stages>
        </pipeline>
      </pipelines>
    </component>
  </components>
</application>

 

Dynamic Request Properties


Besides setting static request properties on initialize through the component's configuration (see above), request properties can be dynamically set through the AspireObject of the incoming job.

Request properties in the AspireObject are read from the structure:

  <doc>
    <requestProperties>
      <requestProperty name="PROP_NAME">PROP_VALUE</requestProperty>
      <requestProperty name="PROP_NAME2">PROP_VALUE2</requestProperty>
      ...
    </requestProperties>
  </doc>

When working with Aspire Batches the values of the first job of the batch will be the ones used to open the connection with the server.

Feed to the GSA example (configuration and XSL)

This section provides an example of PostXml configuration and a XSL template that may be useful for feeding documents to the GSA.

Configure aspire-post-http to use multipart form option. This will prevent the GSA from rejecting the feed because of wrong encodings. Example:

<component name="PostAddOrUpdateToGSA" subType="default" factoryName="aspire-post-http">
  <config>
    <postUrl>${gsaFeedUrl}</postUrl>
    <postXsl>config/xsl/aspireToGSA.xsl</postXsl>
    <okayResponse>Success</okayResponse>
    <debugOutFile>data/debug/gsa.out</debugOutFile>
    <multipartForm contentParam="data">
      <param name="datasource">ppp_feed</param>
      <param name="feedtype">incremental</param>
    </multipartForm>
  </config>
</component>
  • The value of ${gsaFeedUrl} is http://10.10.40.46:19900/xmlfeed, where 10.10.40.46 is the GSA IP address.
  • The default XSL transformation file can be found in AspireToGSA.xsl. Check GSA Feeds Guide and GSA Connector Developer's Guide for more details on the feed XML format.
  • okayResponse is configured to match the response from GSA.
  • debugOutFile is optional, it that file you can see the transformed documents (as they are sent to the GSA).
  • mulitpartForm->datasource: Your feed will show with this name under “Crawl and Index->Feeds” section on GSA administration.
  • multipartForm->feedType: The GSA will keep versions of the same document.
  • Elasticsearch uses http://localhost:9200/_bulk as the URL for batch indexing. See http://www.elasticsearch.org/guide/reference/api/bulk.html for further information about how to use bulk indexing.
  • Elasticsearch returns HTTP 201 for simple indexing (not bulk).

 

Common issues (and how they are normally fixed)

  • There is a feed for each document. Is this normal? Yes, this is normal. This is the most simple feed scenario, one document per feed XML sent to the GSA. If you want more than one feed, checkout the section above to see how to enable batching in branch handler. There is a noticeable performance improvement when batches of documents are sent to the GSA.
  • PostHTTP returns error 401 for any feed. Check that the Aspire machine is on the list of List of Trusted IP Addresses in “Crawl and Index->Feeds” on GSA administration. Or that Trust feeds from all IP addresses is selected.
  • GSA rejects the feed without even opening it. Check that the fed URLs match at least one expression of Start Crawling from the Following URLs in “Crawl and Index->Crawl URLs” on GSA administration.
  • GSA feed shows error “Missing or invalid content” or “Content attribute not properly specified” messages: This is likely a problem with the XSLT. Check that there are no <meta name="someField" content=””> entries on the generated feedXml (in newer versions of the GSA you can download the feed from GSA administration). This is commonly because the XSL is extracting a field that is empty or didn’t exist on the AspireDocument (AspireObject).

JSON Transform


JSON transformers are Groovy Scripts that use JSON Builders to create JSON objects from AspireObjects as input. Further information about JSON transformers syntax at Post JSON

Elasticsearch Indexing Example

Single document indexing (without batching).

  <component name="PostElasticsearch" subType="default" factoryName="aspire-post-http">
    <postUrl>http://localhost:9200/testindex/testtype/</postUrl>
    <postJsonTransform>config/json/aspireToElasticsearch.groovy</postJsonTransform>
    <okayResponse><![CDATA[{"ok":true]]></okayResponse>
  </component>

Elasticsearch Indexing with Basic Authentication Example

Single document indexing (without batching).

  <component name="PostElasticsearch" subType="default" factoryName="aspire-post-http">
    <postUrl>http://localhost:9200/testindex/testtype/</postUrl>
    <postJsonTransform>config/json/aspireToElasticsearch.groovy</postJsonTransform>
    <okayResponse><![CDATA[{"ok":true]]></okayResponse>
    <authentication>basic</authentication>
    <username>admin</username>
    <password>pass</password>
  </component>

Elasticsearch Bulk Indexing


Elasticsearch bulk indexing (batching)

  <component name="PostElasticsearch" subType="default" factoryName="aspire-post-http">
    <postUrl>http://localhost:9200/_bulk</postUrl>
    <postJsonTransform>config/json/aspireToElasticsearchBulk.groovy</postJsonTransform>
    <okayResponse><![CDATA[{"took":]]></okayResponse>
 </component>
  • No labels