The Hot Folder Feeder component periodically monitors a number of directories, processes files in those directories, and publishes to an Aspire pipeline manager. It monitors one or more directories and periodically polls them to look for the presence of files (with an optional file name filter). When the input directory is polled and a file is found (filtered by the optional filter), that file is moved to an in-process directory. The file is then published (from the in-process directory) to the configured pipeline. When processing of the job is complete, the file is moved from the in-process directory to the completed directory (if successful) or to the quarantine directory, if not. When all of the files in the input directory have been processed, the feeder processes the next directory. When no more directories exist, the feeder sleeps for a period of time before polling the directories again.

  • As files are processed, they are moved to a "processing" directory.
  • Following processing, files are moved to a "complete" directory if the processing was successful, or to a "quarantine" directory if it was unsuccessful.
  • This feeder is based on the Simple Feeder
Hot Folder Feeder
Factory Namecom.accenture.aspire:aspire-filefeeder
subTypehotFolderFeeder
InputsThe files in the monitored directorie.
OutputsAn AspireObject containing the path to the discovered file in the monitored directory in the <url> and <fetchUrl> tags, published to the configured pipeline manager

Configuration

This feeder takes all parameters from the Simple Feeder plus the following:

ElementTypeDefaultDescription
feederLabelstringHotFolderFeederThe feeder label submitted in the <feederLabel> of the published document.
jobResultXslstringUse built in XSLThe XSL used to transform job result XML returned from the pipeline. Use to control what content is stored in the retained errors for failed jobs.
writeResultFileForCompletedbooleantrueBy default, the Hot Folder Feeder will create a result.xml file for each completed document in the 'completed' folder. This however, may not be wanted in some cases (i.e. you need to move all the completed files somewhere else, and not the result.xml), so with this option you can disable this feature. It is always wanted to keep the result.xml file, since it provides useful tracking information.
hotFolders
NoneThe configuration of the folders to monitor. See below.

Folder Configuration

The hot folder feeder monitors one or more directories, periodically polling them to look for the presence of files. The folder configuration is shown below.

ElementTypeDescription
hotFolders/hotFolderparent tagHolds all of the information for a single set of hotFolder directories. Each <hotFolder> tag holds the information for set of inputQueue/inProcess/completed/quarantine directories plus all of the parameters (timeouts, wildcard patterns, etc.) necessary for processing the files.

Note that you can have multiple <hotFolder> tags in the same hot folder feeder, as many as you'd like, to handle multiple hot folders from the same feeder.

hotFolders/hotFolder/@matchStringA regular expression detailing the names of the files in the input directory that will be processed. If the file name is not matched by this expression, the file will be ignored. If this option is not specified, all files will be processed.
hotFolders/hotFolder/inputQueueFolderstringThe input directory to monitor. Files found in this directory when the feeder polls will be moved to the in-process directory and published.
hotFolders/hotFolder/inProcessFolderstringFiles found in the input directory will be moved to this directory and published. Files remain in this directory until they are completely processed, after which they are moved to "completed" or "quarantine" as appropriate. Should the system crash, the files in this directory are the ones that never finished, and so should probably be resubmitted (or, they may be the cause of the crash).
hotFolders/hotFolder/completedFolderstringThe completed directory. Files that are processed successfully will be moved to this directory.

If a file is split into sub-jobs, the parent file is still considered to be "successful" (in the current design) even if one of its children/sub-jobs reported an exception error. The parent file is only reported as unsuccessful if the pipeline which processed the main job itself reported an exception.

hotFolders/hotFolder/quarantineFolderstringThe quarantine directory. Files that are processed successfully will be moved to this directory.

Metadata Mapper Configuration

The hot folder feeder maps some metadata fields to fields in the AspireObject.

FieldDefault Output FieldDescription
fileNamefileNameThe filename of the published file.
pathfileNameThe path to the file.
fullFileNamefileNameThe full filename (including the path) to the file.
fullPathfullPathThe full path to the file (excluding the file name).

Example Configurations

Simple

 <component name="simpleDomainFeeder" subType="hotFolderFeeder" factoryName="aspire-filefeeder">
   <hotFolders>
     <hotFolder match=".*\.arc\.gz">
       <inputQueueFolder>${crawlDataBase}/simpleDomain/input-queue</inputQueueFolder>
       <quarantineFolder>${crawlDataBase}/simpleDomain/quarantine</quarantineFolder>
       <completedFolder>${crawlDataBase}/simpleDomain/completed</completedFolder>
       <inProcessFolder>${crawlDataBase}/simpleDomain/in-process</inProcessFolder>
     </hotFolder>
   </hotFolders>
   <branches>
     <branch event="onPublish" pipelineManager="arc-reader-pipe-manager" pipeline="process-arc-file" />
   </branches>
 </component>

Complex

  <component name="simpleDomainFeeder" subType="hotFolderFeeder" factoryName="aspire-filefeeder">
    <feederLabel>CrawlDomain</feederLabel>        
    <metadataMap>
      <map from="fileName" to="fileName"/>
      <map from="fullPath" to="fullPath"/>
    </metadataMap>
    <autoStart>${autoFeedArc}</autoStart>
    <loopWait>43200000</loopWait>
    <feedWait>30000</feedWait>
    <hotFolders>
      <hotFolder match=".*\.arc\.gz">
        <inputQueueFolder>${crawlDataBase}/simpleDomain/input-queue</inputQueueFolder>
        <quarantineFolder>${crawlDataBase}/simpleDomain/quarantine</quarantineFolder>
        <completedFolder>${crawlDataBase}/simpleDomain/completed</completedFolder>
        <inProcessFolder>${crawlDataBase}/simpleDomain/in-process</inProcessFolder>
      </hotFolder>
    </hotFolders>
    <branches>
      <branch event="onPublish" pipelineManager="arc-reader-pipe-manager" pipeline="process-arc-file" />
    </branches>
  </component>


  • No labels