Page History

...

Instead, it reads data from the input stream and outputs XML records to the sub-job pipeline as they are found using a SAX handler. This makes it very fast with very low memory requirements.

XML Sub Job Extractor
Factory Name	com.searchtechnologies.aspire:aspire-xml-files
subType	xmlSubJobExtractor
Inputs	object['contentStream'] containing a data stream. object['contentBytes'] a stream to the XML to process. NOTE: A previous job (typically FetchURL) must have opened the input stream.
Outputs	An AspireObject containing data for each sub-job contain the XML of the individual XML record, published to the configured sub-job pipeline manager.

Configuration
Anchor
Configuration
Configuration

Element	Type	Default	Description
branches		None	The configuration of the pipeline to publish to. See below.
maxSubJobs	integer	0 (= all)	The maximum number of subjobs to generate. If there are more possible jobs in the input XML file, they will be ignored.
characterEncoding	String	UTF-8	The character encoding of the XML file to be read, if not UTF-8.
rootNode	String	None	The root node which contains the sub-jobs to publish. If not specified, the root node of the entire XML tree is considered to be the root node. This value should be in path format, for example: /results/hits . This will publish as sub-jobs all of the child elements which occur within the <results>/<hits> tag. Note: This is not an XPath, just a path which represents a named node within the XML hierarchy. It should start with a / and this will be added if missing.
cleanse	boolean	true	Set to true if you want to clean the XML content from non-readable characters (.i.e ASCII code 15).
honorDTD	boolean	false	Set to true if you want to fetch XML's DTD.
batchJobs (2.1 Release)	boolean	false	Set to true if you want the extractor to create a batch for the input job and add child jobs to that batch.
maxBatchBytes (2.1 Release)	long	unlimited	When using batching, limit the data in a batch based on the xml representation of the AspireObject published with the job. Once the amount of data added to the batch exceeds the value given, the batch will be closed and a new batch created. The limit may be specified in the form 1, 1b, 1k, 1kb, 1m, 1mb, 1g, 1gb
maxBatchBytesJSON (2.1 Release)	boolean	false	When limiting the batch size, use the JSON representation of the AspireObject published with the job when calculating batch size

Branch Configuration
Anchor
Branch Configuration
Branch Configuration

The XML Sub Job Extractor publishes documents using the branch manager. It publishes using the events configured above. You must therefore include <branches> for these events in the configuration to publish to a pipeline within a pipeline manager. See Branch Handler for more details.

Element	Type	Description
branches/branch/@event	String	The event to configure. Should always be "onSubJob".
branches/branch/@pipelineManager	string	The URL of the pipeline manager to publish to. Can be relative.
branches/branch/@pipeline	string	The name of the pipeline to publish to.

Example Configuration
Anchor
Example Configuration
Example Configuration

Code Block

language	xml
linenumbers	true

<!-- Use FetchUrl to open a stream on the object which is then used by XMLSubJobExtract -->
<component name="FetchUrl" subType="default" factoryName="aspire-fetch-url" />    
<component name="XMLSubJobExtractor" subType="xmlSubJobExtractor" factoryName="aspire-xml-files">
    <branches>
      <branch event="onSubJob" pipelineManager="../ProcessSingleRecord" />
    </branches>
</component>

...

Input example
Anchor
Input example
Input example

Typical input XML documents look like this:

...

Note that every child of the root element (and which element represents the "root" can be specified with the rootNode configuration parameter) will be processed as a separate sub-job document. Therefore, the above XML will produce the following sub-job XML documents:

Sub Job #1:
Anchor
Sub Job #1
Sub Job #1

Code Block

language	xml
linenumbers	true

<doc id="32" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlTag="record">
  <parent>
     -- NOTE:  a copy of the parent metadata is stored here --
  </parent>
  <first_name>george</first_name>
  <last_name>washington</last_name>
  <description>Founding father #1</description>
</doc>

Sub Job #2:
Anchor
Sub Job #2:
Sub Job #2:

Code Block

language	xml
linenumbers	true

<doc id="33" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlTag="record">
  <parent>
     -- NOTE:  a copy of the parent metadata is stored here --
  </parent>
  <first_name>thomas</first_name>
  <last_name>jefferson</last_name>
  <description>Founding father #2</description>
</doc>

...

Page tree

Versions Compared

Old Version 2

New Version 3

Key

Contents

Configuration
Anchor
Configuration
Configuration

Branch Configuration
Anchor
Branch Configuration
Branch Configuration

Example Configuration
Anchor
Example Configuration
Example Configuration

Input example
Anchor
Input example
Input example

Sub Job #1:
Anchor
Sub Job #1
Sub Job #1

Sub Job #2:
Anchor
Sub Job #2:
Sub Job #2:

Page tree

Page History

Versions Compared

Old Version 2

New Version 3

Key

Contents

Configuration AnchorConfigurationConfiguration

Branch Configuration AnchorBranch ConfigurationBranch Configuration

Example Configuration AnchorExample ConfigurationExample Configuration

Input example AnchorInput exampleInput example

Sub Job #1: AnchorSub Job #1Sub Job #1

Sub Job #2: AnchorSub Job #2:Sub Job #2:

Configuration
Anchor
Configuration
Configuration

Branch Configuration
Anchor
Branch Configuration
Branch Configuration

Example Configuration
Anchor
Example Configuration
Example Configuration

Input example
Anchor
Input example
Input example

Sub Job #1:
Anchor
Sub Job #1
Sub Job #1

Sub Job #2:
Anchor
Sub Job #2:
Sub Job #2: