...
Instead, it reads data from the input stream and outputs XML records to the sub-job pipeline as they are found using a SAX handler. This makes it very fast with very low memory requirements.
XML Sub Job Extractor | |
---|---|
Factory Name | com.searchtechnologies.aspire:aspire-xml-files |
subType | xmlSubJobExtractor |
Inputs | object['contentStream'] containing a data stream. object['contentBytes'] a stream to the XML to process. NOTE: A previous job (typically FetchURL) must have |
Outputs | An AspireObject containing data for each sub-job contain the XML of the individual XML record, published to the configured sub-job pipeline manager. |
...
...
. |
...
Element | Type | Default | Description |
---|---|---|---|
branches | None | The configuration of the pipeline to publish to. See below. | |
maxSubJobs | integer | 0 (= all) | The maximum number of subjobs to generate. If there are more possible jobs in the input XML file, they will be ignored. |
characterEncoding | String | UTF-8 | The character encoding of the XML file to be read, if not UTF-8. |
rootNode | String | None | The root node which contains the sub-jobs to publish. If not specified, the root node of the entire XML tree is considered to be the root node. This value should be in path format, for example: /results/hits . This will publish as sub-jobs all of the child elements which occur within the <results>/<hits> tag. Note: This is not an XPath, just a path which represents a named node within the XML hierarchy. It should start with a / and this will be added if missing. |
cleanse | boolean | true | Set to true if you want to clean the XML content from non-readable characters (.i.e ASCII code 15). |
honorDTD | boolean | false | Set to true if you want to fetch XML's DTD. |
batchJobs (2.1 Release) | boolean | false | Set to true if you want the extractor to create a batch for the input job and add child jobs to that batch. |
maxBatchBytes (2.1 Release) | long | unlimited | When using batching, limit the data in a batch based on the xml representation of the AspireObject published with the job. Once the amount of data added to the batch exceeds the value given, the batch will be closed and a new batch created. The limit may be specified in the form 1, 1b, 1k, 1kb, 1m, 1mb, 1g, 1gb |
maxBatchBytesJSON (2.1 Release) | boolean | false | When limiting the batch size, use the JSON representation of the AspireObject published with the job when calculating batch size |
...