The Hot Folder Feeder component periodically monitors a number of directories, processes files in those directories, and publishes to an Aspire pipeline manager. It monitors one or more directories and periodically polls them to look for the presence of files (with an optional file name filter). When the input directory is polled and a file is found (filtered by the optional filter), that file is moved to an in-process directory. The file is then published (from the in-process directory) to the configured pipeline. When processing of the job is complete, the file is moved from the in-process directory to the completed directory (if successful) or to the quarantine directory, if not. When all of the files in the input directory have been processed, the feeder processes the next directory. When no more directories exist, the feeder sleeps for a period of time before polling the directories again.
Hot Folder Feeder | |
---|---|
Factory Name | com.searchtechnologies.aspire:aspire-filefeeder |
subType | hotFolderFeeder |
Inputs | The files in the monitored directorie. |
Outputs | An AspireObject containing the path to the discovered file in the monitored directory in the <url> and <fetchUrl> tags, published to the configured pipeline manager |
This feeder takes all parameters from the Simple Feeder plus the following:
Element | Type | Default | Description |
---|---|---|---|
feederLabel | string | HotFolderFeeder | The feeder label submitted in the <feederLabel> of the published document. |
jobResultXsl | string | Use built in XSL | The XSL used to transform job result XML returned from the pipeline. Use to control what content is stored in the retained errors for failed jobs. |
writeResultFileForCompleted | boolean | true | By default, the Hot Folder Feeder will create a result.xml file for each completed document in the 'completed' folder. This however, may not be wanted in some cases (i.e. you need to move all the completed files somewhere else, and not the result.xml), so with this option you can disable this feature. It is always wanted to keep the result.xml file, since it provides useful tracking information. |
hotFolders | None | The configuration of the folders to monitor. See below. |
The hot folder feeder monitors one or more directories, periodically polling them to look for the presence of files. The folder configuration is shown below.
Element | Type | Description |
---|---|---|
hotFolders/hotFolder | parent tag | Holds all of the information for a single set of hotFolder directories. Each <hotFolder> tag holds the information for set of inputQueue/inProcess/completed/quarantine directories plus all of the parameters (timeouts, wildcard patterns, etc.) necessary for processing the files. Note that you can have multiple <hotFolder> tags in the same hot folder feeder, as many as you'd like, to handle multiple hot folders from the same feeder. |
hotFolders/hotFolder/@match | String | A regular expression detailing the names of the files in the input directory that will be processed. If the file name is not matched by this expression, the file will be ignored. If this option is not specified, all files will be processed. |
hotFolders/hotFolder/inputQueueFolder | string | The input directory to monitor. Files found in this directory when the feeder polls will be moved to the in-process directory and published. |
hotFolders/hotFolder/inProcessFolder | string | Files found in the input directory will be moved to this directory and published. Files remain in this directory until they are completely processed, after which they are moved to "completed" or "quarantine" as appropriate. Should the system crash, the files in this directory are the ones that never finished, and so should probably be resubmitted (or, they may be the cause of the crash). |
hotFolders/hotFolder/completedFolder | string | The completed directory. Files that are processed successfully will be moved to this directory. If a file is split into sub-jobs, the parent file is still considered to be "successful" (in the current design) even if one of its children/sub-jobs reported an exception error. The parent file is only reported as unsuccessful if the pipeline which processed the main job itself reported an exception. |
hotFolders/hotFolder/quarantineFolder | string | The quarantine directory. Files that are processed successfully will be moved to this directory. |
The hot folder feeder maps some metadata fields to fields in the AspireObject.
Field | Default Output Field | Description |
---|---|---|
fileName | fileName | The filename of the published file. |
path | fileName | The path to the file. |
fullFileName | fileName | The full filename (including the path) to the file. |
fullPath | fullPath | The full path to the file (excluding the file name). |
<component name="simpleDomainFeeder" subType="hotFolderFeeder" factoryName="aspire-filefeeder"> <hotFolders> <hotFolder match=".*\.arc\.gz"> <inputQueueFolder>${crawlDataBase}/simpleDomain/input-queue</inputQueueFolder> <quarantineFolder>${crawlDataBase}/simpleDomain/quarantine</quarantineFolder> <completedFolder>${crawlDataBase}/simpleDomain/completed</completedFolder> <inProcessFolder>${crawlDataBase}/simpleDomain/in-process</inProcessFolder> </hotFolder> </hotFolders> <branches> <branch event="onPublish" pipelineManager="arc-reader-pipe-manager" pipeline="process-arc-file" /> </branches> </component>
<component name="simpleDomainFeeder" subType="hotFolderFeeder" factoryName="aspire-filefeeder"> <feederLabel>CrawlDomain</feederLabel> <metadataMap> <map from="fileName" to="fileName"/> <map from="fullPath" to="fullPath"/> </metadataMap> <autoStart>${autoFeedArc}</autoStart> <loopWait>43200000</loopWait> <feedWait>30000</feedWait> <hotFolders> <hotFolder match=".*\.arc\.gz"> <inputQueueFolder>${crawlDataBase}/simpleDomain/input-queue</inputQueueFolder> <quarantineFolder>${crawlDataBase}/simpleDomain/quarantine</quarantineFolder> <completedFolder>${crawlDataBase}/simpleDomain/completed</completedFolder> <inProcessFolder>${crawlDataBase}/simpleDomain/in-process</inProcessFolder> </hotFolder> </hotFolders> <branches> <branch event="onPublish" pipelineManager="arc-reader-pipe-manager" pipeline="process-arc-file" /> </branches> </component>