Page History
Configuration
Element | Type | Default | Description |
---|---|---|---|
branches | None | The configuration of the pipeline to publish to. See below. | |
fileNamePatterns/include/@pattern | String | null | The include pattern can be regular expression to allow files e.g. ".*.xml$". |
fileNamePatterns/exclude/@pattern | String | null | The exclude pattern can be regular expression to disallow files e.g. ".*tmp[^/]$". |
pathToScan | String | null | The directory location e.g. file:///C:/aspire-home/data specified in <pathToScan> would be scanned in the absence of fetchUrl to feed allowed files. When fetchUrl (AspireObject element) is specified, that location will be scanned to feed allowed files. |
Branch Configuration
The feed one feeder publishes files using the branch manager. It publishes using the onPublish event. You must therefore include a <branches> element in the configuration to publish to a pipeline within a pipeline manager. See Branch Handler for more details.
Element | Type | Description |
---|---|---|
branches/branch/@event | String | The event to configure. This must be onPublish. |
branches/branch/@pipelineManager | string | The name of the pipeline manager to publish to. Can be relative. |
branches/branch/@pipeline | string | The name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager. |
Metadata Mapper Configuration
The ScanDir stage contains a large number of additional metadata fields which can be mapped to fields in the AspireObject XML.
Field | Default Output Field | Description |
---|---|---|
protocol | protocol | The protocol of the URL (for example, "http" for "http://www.searchtechnologies.com"). |
host | host | The host name of the URL (for example, "www.searchtechnologies.com" for "http://www.searchtechnologies.com"). |
mimeType | mimeType | The mime type returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "text/html". |
encoding | encoding | The content encoding as returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "UTF-8" |
expirationDate | expirationDate | The expiration date reported by the HTTP server in the "expires" http header, if it exists. Formatted as an ISO 8601 date-time. |
modificationDate | modificationDate | The modification date reported by the HTTP server in the "last-modified" http header, if it exists. Formatted as an ISO 8601 date-time. |
redirectUrl | redirectUrl | If the HTTP server reported a 3XX code and the URL was automatically redirected to another URL, this element provides the new URL. |
status | - | The HTTP response status message. For example, "HTTP/1.1 200 OK". |
all other HTTP headers | - | Note that any HTTP header is available to be mapped by the metadata mapper. All headers not mapped are automatically put into the <extension> area. |
Scanning directory via HTTP Command
You can tell Scan Directory to scan the directory using an HTTP command directly through the Admin interface. The URL would be:
http://<server>:50505/aspire/<component-name>?cmd=feed&url=<directory to feed>
Example Configuration for directory scan
Always scan the same directory
<component name="ScanDir" subType="scanDir" factoryName="aspire-storage-handler" >
<pathToScan>file:///C:/aspire-home/st_files</pathToScan>
<fileNamePatterns>
<include pattern=".*.xml$" />
<exclude pattern=".*tmp[^/]$" />
</fileNamePatterns>
<branches>
<branch event="onPublish" pipelineManager="ProcessFile" pipeline="process-doc" />
</branches>
</component>
In this example, directory specified for <pathToScan> is scanned and based on include/exclude patterns, the fetchUrl is generated to push to "ProcessFile" pipeline. Multiple include pattern and exclude pattern can be specified with multiple entries of <include /> and <exclude/> tags. If same pattern is specified in include and exclude pattern, then exclude takes the precedence.
Complex configuration
This configuration specifies meta data mapping.
<component name="ScanDir" subType="scanDir" factoryName="aspire-storage-handler" >
<fileNamePatterns>
<include pattern=".*.txt$" />
<exclude pattern=".*tmp[^/]$" />
</fileNamePatterns>
<metadataMap>
<map from="content-length-bytes" to="file-length"/>
<map from="file-name" to="file-name"/>
</metadataMap>
<branches>
<branch event="onPublish" pipelineManager="." pipeline="ProcessFile" />
</branches>
</component>
Example Output
<doc>
<fetchUrl>file:/C:/work/workspace1/aspire-storage-handler/testdata/scanDirTest1/printwriter.txt</fetchUrl>
<file-length source="ScanDir/content-length-bytes">19</file-length>
<file-name source="ScanDir/file-name">printwriter.txt</file-name>
<extension source="ScanDir">
<field name="modified-date">2011-04-13T16:49:49Z</field>
<field name="parent-dir">testdata\scanDirTest1</field>
<field name="absolute-path">C:\work\workspace1\aspire-storage-handler\testdata\scanDirTest1\printwriter.txt</field>
</extension>
.
.
.
</doc>