This stage is primarily intended to split an XML file containing a list of records, and then to process each individual record one at a time, as sub-jobs on their own pipeline. These sorts of XML files are commonly produced by relational databases.

It takes a job which contains a data stream and assumes that the data stream represents an XML document. It then parses through the XML document to extract sub-job documents. Note that the XML sub-job handler does not load the entire XML into an in-memory DOM object.

Instead, it reads data from the input stream and outputs XML records to the sub-job pipeline as they are found using a SAX handler. This makes it very fast with very low memory requirements.

XML Sub Job Extractor
Factory Name com.searchtechnologies.aspire:aspire-xml-files
subType xmlSubJobExtractor
Inputs object['contentStream'] containing a data stream.

object['contentBytes'] a stream to the XML to process.

NOTE: A previous job (typically FetchURL) must have
opened the input stream.

Outputs An AspireObject containing data for each sub-job
contain the XML of the individual XML record,
published to the configured sub-job pipeline manager.

Configuration


ElementTypeDefaultDescription
branches NoneThe configuration of the pipeline to publish to. See below.
maxSubJobsinteger0 (= all)The maximum number of subjobs to generate. If there are more possible jobs in the input XML file, they will be ignored.
characterEncodingStringUTF-8The character encoding of the XML file to be read, if not UTF-8.
rootNodeStringNoneThe root node which contains the sub-jobs to publish. If not specified, the root node of the entire XML tree is considered to be the root node.

This value should be in path format, for example: /results/hits . This will publish as sub-jobs all of the child elements which occur within the <results>/<hits> tag.

Note: This is not an XPath, just a path which represents a named node within the XML hierarchy. It should start with a / and this will be added if missing.

cleansebooleantrueSet to true if you want to clean the XML content from non-readable characters (.i.e ASCII code 15).
honorDTDbooleanfalseSet to true if you want to fetch XML's DTD.
batchJobs  (2.1 Release)  booleanfalseSet to true if you want the extractor to create a batch for the input job and add child jobs to that batch.
maxBatchBytes  (2.1 Release)  longunlimitedWhen using batching, limit the data in a batch based on the xml representation of the AspireObject published with the job. Once the amount of data added to the batch exceeds the value given, the batch will be closed and a new batch created. The limit may be specified in the form 1, 1b, 1k, 1kb, 1m, 1mb, 1g, 1gb
maxBatchBytesJSON  (2.1 Release)  booleanfalseWhen limiting the batch size, use the JSON representation of the AspireObject published with the job when calculating batch size

Branch Configuration

The XML Sub Job Extractor publishes documents using the branch manager. It publishes using the events configured above. You must therefore include <branches> for these events in the configuration to publish to a pipeline within a pipeline manager. See Branch Handler for more details.

ElementTypeDescription
branches/branch/@eventStringThe event to configure. Should always be "onSubJob".
branches/branch/@pipelineManagerstringThe URL of the pipeline manager to publish to. Can be relative.
branches/branch/@pipelinestringThe name of the pipeline to publish to.

Example Configuration


<!-- Use FetchUrl to open a stream on the object which is then used by XMLSubJobExtract -->
<component name="FetchUrl" subType="default" factoryName="aspire-fetch-url" />    
<component name="XMLSubJobExtractor" subType="xmlSubJobExtractor" factoryName="aspire-xml-files">
    <branches>
      <branch event="onSubJob" pipelineManager="../ProcessSingleRecord" />
    </branches>
</component>

Input Example


Typical input XML documents look like this:

<records xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<record id="32">
  <first_name>george</first_name>
  <last_name>washington</last_name>
  <description>Founding father #1</description>
</record>
<record id="33">
  <first_name>thomas</first_name>
  <last_name>jefferson</last_name>
  <description>Founding father #2</description>
</record>
</records>


Note: Every child of the root element (and which element represents the "root" can be specified with the rootNode configuration parameter) will be processed as a separate sub-job document. Therefore, the above XML will produce the following sub-job XML documents:

Sub Job #1

<doc id="32" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlTag="record">
  <parent>
     -- NOTE:  a copy of the parent metadata is stored here --
  </parent>
  <first_name>george</first_name>
  <last_name>washington</last_name>
  <description>Founding father #1</description>
</doc>

Sub Job #2

<doc id="33" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlTag="record">
  <parent>
     -- NOTE:  a copy of the parent metadata is stored here --
  </parent>
  <first_name>thomas</first_name>
  <last_name>jefferson</last_name>
  <description>Founding father #2</description>
</doc>

 

The top-level <doc> element for the sub job will contain all of the attributes for the parent XML element from the original file (i.e. the attributes on the <records> element from above) as well as all of the attributes from each sub-record (all of the attributes from each <record> element in turn). This should ensure that transforms on XML files which require nested name-spaces can occur properly.

  • No labels