The Scan Directory stage is subtype of storage-handler service. It scans the directory including all sub-directories and creates sub-jobs for all nested files.
Scan Directory | |
---|---|
Factory Name | com.searchtechnologies.aspire:aspire-storage-handler |
subType | scanDir |
Inputs | Directory location specified in <fetchUrl> from AspireObject <p/> Alternatively, you can specify pathToScan to scan the same directory every time (feedOne merely launches the job, in this case). |
Outputs | Sub Jobs, each with an AspireObject which contains a <fetchUrl> that holds a URL to the file which was scanned. |
Element | Type | Default | Description |
---|---|---|---|
branches | None | The configuration of the pipeline to publish to. See below. | |
fileNamePatterns/include/@pattern | String | null | The include pattern can be regular expression to allow files e.g. ".*.xml$". |
fileNamePatterns/exclude/@pattern | String | null | The exclude pattern can be regular expression to disallow files e.g. ".*tmp[^/]$". |
pathToScan | String | null | The directory location e.g. file:///C:/aspire-home/data specified in <pathToScan> would be scanned in the absence of fetchUrl to feed allowed files. When fetchUrl (AspireObject element) is specified, that location will be scanned to feed allowed files. |
The feed one feeder publishes files using the branch manager. It publishes using the onPublish event. You must therefore include a <branches> element in the configuration to publish to a pipeline within a pipeline manager. See Branch Handler for more details.
Element | Type | Description |
---|---|---|
branches/branch/@event | String | The event to configure. This must be onPublish. |
branches/branch/@pipelineManager | string | The name of the pipeline manager to publish to. Can be relative. |
branches/branch/@pipeline | string | The name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager. |
The ScanDir stage contains a large number of additional metadata fields which can be mapped to fields in the AspireObject XML.
Field | Default Output Field | Description |
---|---|---|
protocol | protocol | The protocol of the URL (for example, "http" for "http://www.searchtechnologies.com"). |
host | host | The host name of the URL (for example, "www.searchtechnologies.com" for "http://www.searchtechnologies.com"). |
mimeType | mimeType | The mime type returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "text/html". |
encoding | encoding | The content encoding as returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "UTF-8" |
expirationDate | expirationDate | The expiration date reported by the HTTP server in the "expires" http header, if it exists. Formatted as an ISO 8601 date-time. |
modificationDate | modificationDate | The modification date reported by the HTTP server in the "last-modified" http header, if it exists. Formatted as an ISO 8601 date-time. |
redirectUrl | redirectUrl | If the HTTP server reported a 3XX code and the URL was automatically redirected to another URL, this element provides the new URL. |
status | - | The HTTP response status message. For example, "HTTP/1.1 200 OK". |
all other HTTP headers | - | Note that any HTTP header is available to be mapped by the metadata mapper. All headers not mapped are automatically put into the <extension> area. |
You can tell Scan Directory to scan the directory using an HTTP command directly through the Admin interface. The URL would be:
http://<server>:50505/aspire/<component-name>?cmd=feed&url=<directory to feed>
<component name="ScanDir" subType="scanDir" factoryName="aspire-storage-handler" > <pathToScan>file:///C:/aspire-home/st_files</pathToScan> <fileNamePatterns> <include pattern=".*.xml$" /> <exclude pattern=".*tmp[^/]$" /> </fileNamePatterns> <branches> <branch event="onPublish" pipelineManager="ProcessFile" pipeline="process-doc" /> </branches> </component>
In this example, directory specified for <pathToScan> is scanned and based on include/exclude patterns, the fetchUrl is generated to push to "ProcessFile" pipeline. Multiple include pattern and exclude pattern can be specified with multiple entries of <include /> and <exclude/> tags. If same pattern is specified in include and exclude pattern, then exclude takes the precedence.
This configuration specifies meta data mapping.
<component name="ScanDir" subType="scanDir" factoryName="aspire-storage-handler" > <fileNamePatterns> <include pattern=".*.txt$" /> <exclude pattern=".*tmp[^/]$" /> </fileNamePatterns> <metadataMap> <map from="content-length-bytes" to="file-length"/> <map from="file-name" to="file-name"/> </metadataMap> <branches> <branch event="onPublish" pipelineManager="." pipeline="ProcessFile" /> </branches> </component>
<doc> <fetchUrl>file:/C:/work/workspace1/aspire-storage-handler/testdata/scanDirTest1/printwriter.txt</fetchUrl> <file-length source="ScanDir/content-length-bytes">19</file-length> <file-name source="ScanDir/file-name">printwriter.txt</file-name> <extension source="ScanDir"> <field name="modified-date">2011-04-13T16:49:49Z</field> <field name="parent-dir">testdata\scanDirTest1</field> <field name="absolute-path">C:\work\workspace1\aspire-storage-handler\testdata\scanDirTest1\printwriter.txt</field> </extension> . . . </doc>