The Scan Directory stage is subtype of storage-handler service. It scans the directory including all sub-directories and creates sub-jobs for all nested files.

Scan Directory
Factory Namecom.searchtechnologies.aspire:aspire-storage-handler
subType

scanDir

InputsDirectory location specified in <fetchUrl> from AspireObject <p/> Alternatively, you can specify pathToScan to scan the same directory every time (feedOne merely launches the job, in this case).
OutputsSub Jobs, each with an AspireObject which contains a <fetchUrl> that holds a URL to the file which was scanned.

Configuration


ElementTypeDefaultDescription
branches NoneThe configuration of the pipeline to publish to. See below.
fileNamePatterns/include/@patternStringnullThe include pattern can be regular expression to allow files e.g. ".*.xml$".
fileNamePatterns/exclude/@patternStringnullThe exclude pattern can be regular expression to disallow files e.g. ".*tmp[^/]$".
pathToScanStringnullThe directory location e.g. file:///C:/aspire-home/data specified in <pathToScan> would be scanned in the absence of fetchUrl to feed allowed files. When fetchUrl (AspireObject element) is specified, that location will be scanned to feed allowed files.

 

Branch Configuration

The feed one feeder publishes files using the branch manager. It publishes using the onPublish event. You must therefore include a <branches> element in the configuration to publish to a pipeline within a pipeline manager. See Branch Handler for more details.

ElementTypeDescription
branches/branch/@eventStringThe event to configure. This must be onPublish.
branches/branch/@pipelineManagerstringThe name of the pipeline manager to publish to. Can be relative.
branches/branch/@pipelinestringThe name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager.

 

Metadata Mapper Configuration

The ScanDir stage contains a large number of additional metadata fields which can be mapped to fields in the AspireObject XML.

 

FieldDefault Output FieldDescription
protocolprotocolThe protocol of the URL (for example, "http" for "http://www.searchtechnologies.com").
hosthostThe host name of the URL (for example, "www.searchtechnologies.com" for "http://www.searchtechnologies.com").
mimeTypemimeTypeThe mime type returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "text/html".
encodingencodingThe content encoding as returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "UTF-8"
expirationDateexpirationDateThe expiration date reported by the HTTP server in the "expires" http header, if it exists. Formatted as an ISO 8601 date-time.
modificationDatemodificationDateThe modification date reported by the HTTP server in the "last-modified" http header, if it exists. Formatted as an ISO 8601 date-time.
redirectUrlredirectUrlIf the HTTP server reported a 3XX code and the URL was automatically redirected to another URL, this element provides the new URL.
status-The HTTP response status message. For example, "HTTP/1.1 200 OK".
all other HTTP headers-Note that any HTTP header is available to be mapped by the metadata mapper. All headers not mapped are automatically put into the <extension> area.

 

Scanning directory via HTTP Command


You can tell Scan Directory to scan the directory using an HTTP command directly through the Admin interface. The URL would be:

http://<server>:50505/aspire/<component-name>?cmd=feed&url=<directory to feed>

Example Configuration for directory scan

Always scan the same directory

  <component name="ScanDir" subType="scanDir" factoryName="aspire-storage-handler" >
    <pathToScan>file:///C:/aspire-home/st_files</pathToScan> 
    <fileNamePatterns>
      <include pattern=".*.xml$" />
      <exclude pattern=".*tmp[^/]$" />
    </fileNamePatterns>
    <branches>
      <branch event="onPublish" pipelineManager="ProcessFile" pipeline="process-doc" />
    </branches> 
  </component>

In this example, directory specified for <pathToScan> is scanned and based on include/exclude patterns, the fetchUrl is generated to push to "ProcessFile" pipeline. Multiple include pattern and exclude pattern can be specified with multiple entries of <include /> and <exclude/> tags. If same pattern is specified in include and exclude pattern, then exclude takes the precedence.

Complex configuration

This configuration specifies meta data mapping.

  <component name="ScanDir" subType="scanDir" factoryName="aspire-storage-handler" >
    <fileNamePatterns>
      <include pattern=".*.txt$" />
      <exclude pattern=".*tmp[^/]$" />
    </fileNamePatterns>
    <metadataMap>
      <map from="content-length-bytes" to="file-length"/>
      <map from="file-name" to="file-name"/>
    </metadataMap>
    <branches>
      <branch event="onPublish" pipelineManager="." pipeline="ProcessFile" />
    </branches> 
  </component>

Example Output

<doc>
    <fetchUrl>file:/C:/work/workspace1/aspire-storage-handler/testdata/scanDirTest1/printwriter.txt</fetchUrl>
    <file-length source="ScanDir/content-length-bytes">19</file-length>
    <file-name source="ScanDir/file-name">printwriter.txt</file-name>
    <extension source="ScanDir">
        <field name="modified-date">2011-04-13T16:49:49Z</field>
        <field name="parent-dir">testdata\scanDirTest1</field>
        <field name="absolute-path">C:\work\workspace1\aspire-storage-handler\testdata\scanDirTest1\printwriter.txt</field>
    </extension>
  .
  .
  .
</doc>
  • No labels