The Scan Directory stage is subtype of storage-handler service. It scans the directory including all sub-directories and creates sub-jobs for all nested files.

Scan Directory
Factory Name	com.searchtechnologies.aspire:aspire-storage-handler
subType	scanDir
Inputs	Directory location specified in <fetchUrl> from AspireObject <p/> Alternatively, you can specify pathToScan to scan the same directory every time (feedOne merely launches the job, in this case).
Outputs	Sub Jobs, each with an AspireObject which contains a <fetchUrl> that holds a URL to the file which was scanned.

Configuration

Element	Type	Default	Description
branches		None	The configuration of the pipeline to publish to. See below.
fileNamePatterns/include/@pattern	String	null	The include pattern can be regular expression to allow files e.g. ".*.xml$".
fileNamePatterns/exclude/@pattern	String	null	The exclude pattern can be regular expression to disallow files e.g. ".*tmp[^/]$".
pathToScan	String	null	The directory location e.g. file:///C:/aspire-home/data specified in <pathToScan> would be scanned in the absence of fetchUrl to feed allowed files. When fetchUrl (AspireObject element) is specified, that location will be scanned to feed allowed files.

Branch Configuration

The feed one feeder publishes files using the branch manager. It publishes using the onPublish event. You must therefore include a <branches> element in the configuration to publish to a pipeline within a pipeline manager. See Branch Handler for more details.

Element	Type	Description
branches/branch/@event	String	The event to configure. This must be onPublish.
branches/branch/@pipelineManager	string	The name of the pipeline manager to publish to. Can be relative.
branches/branch/@pipeline	string	The name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager.

Metadata Mapper Configuration

The ScanDir stage contains a large number of additional metadata fields which can be mapped to fields in the AspireObject XML.

Field	Default Output Field	Description
protocol	protocol	The protocol of the URL (for example, "http" for "http://www.searchtechnologies.com").
host	host	The host name of the URL (for example, "www.searchtechnologies.com" for "http://www.searchtechnologies.com").
mimeType	mimeType	The mime type returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "text/html".
encoding	encoding	The content encoding as returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "UTF-8"
expirationDate	expirationDate	The expiration date reported by the HTTP server in the "expires" http header, if it exists. Formatted as an ISO 8601 date-time.
modificationDate	modificationDate	The modification date reported by the HTTP server in the "last-modified" http header, if it exists. Formatted as an ISO 8601 date-time.
redirectUrl	redirectUrl	If the HTTP server reported a 3XX code and the URL was automatically redirected to another URL, this element provides the new URL.
status	-	The HTTP response status message. For example, "HTTP/1.1 200 OK".
all other HTTP headers	-	Note that any HTTP header is available to be mapped by the metadata mapper. All headers not mapped are automatically put into the <extension> area.

Scanning directory via HTTP Command

You can tell Scan Directory to scan the directory using an HTTP command directly through the Admin interface. The URL would be:

http://<server>:50505/aspire/<component-name>?cmd=feed&url=<directory to feed>

Example Configuration for directory scan

Always scan the same directory

  <component name="ScanDir" subType="scanDir" factoryName="aspire-storage-handler" >
    <pathToScan>file:///C:/aspire-home/st_files</pathToScan> 
    <fileNamePatterns>
      <include pattern=".*.xml$" />
      <exclude pattern=".*tmp[^/]$" />
    </fileNamePatterns>
    <branches>
      <branch event="onPublish" pipelineManager="ProcessFile" pipeline="process-doc" />
    </branches> 
  </component>

In this example, directory specified for <pathToScan> is scanned and based on include/exclude patterns, the fetchUrl is generated to push to "ProcessFile" pipeline. Multiple include pattern and exclude pattern can be specified with multiple entries of <include /> and <exclude/> tags. If same pattern is specified in include and exclude pattern, then exclude takes the precedence.

Complex configuration

This configuration specifies meta data mapping.

  <component name="ScanDir" subType="scanDir" factoryName="aspire-storage-handler" >
    <fileNamePatterns>
      <include pattern=".*.txt$" />
      <exclude pattern=".*tmp[^/]$" />
    </fileNamePatterns>
    <metadataMap>
      <map from="content-length-bytes" to="file-length"/>
      <map from="file-name" to="file-name"/>
    </metadataMap>
    <branches>
      <branch event="onPublish" pipelineManager="." pipeline="ProcessFile" />
    </branches> 
  </component>

Example Output

<doc>
    <fetchUrl>file:/C:/work/workspace1/aspire-storage-handler/testdata/scanDirTest1/printwriter.txt</fetchUrl>
    <file-length source="ScanDir/content-length-bytes">19</file-length>
    <file-name source="ScanDir/file-name">printwriter.txt</file-name>
    <extension source="ScanDir">
        <field name="modified-date">2011-04-13T16:49:49Z</field>
        <field name="parent-dir">testdata\scanDirTest1</field>
        <field name="absolute-path">C:\work\workspace1\aspire-storage-handler\testdata\scanDirTest1\printwriter.txt</field>
    </extension>
  .
  .
  .
</doc>

Page tree

Scan Directory