The FTP Scanner component performs full and incremental scans of content on an FTP server, maintaining a snapshot of the filesystem and comparing it with the current content to establish what content has been updated. Updated content is then submitted to the configured pipeline in AspireObjects attached to Jobs. As well as the URL of the changed item, the AspireObject will also contain metadata extracted from the repository. Updated content is split into three types - add, update and delete-. Each type of content is published as a different event so that it may be handled by different Aspire pipelines.

The scanner reacts to an incoming job. This job may instruct the scanner to start, stop, pause or resume. Typically the start job will contain all information required by the job to perform the crawl. However, the scanner can be configured with default values via application.xml file. When pausing or stopping, the scanner will wait until all the jobs it published have completed before completing itself.

Note: The FTP Scanner component was written during an Extreme Programming exercise at the company kick-off. We made it available as proof of concept connector. If you find bugs, or if doesn't do what you want, let us know and we'll try to fix it.

FTP Scanner
Factory Namecom.searchtechnologies.aspire:aspire-ftp-scanner
subTypedefaut
InputsAspireObject from a content source submitter holding all the information required for a crawl
OutputsJobs from the crawl

Configuration

This section lists all configuration parameters available to configure the FTP Scanner component.

General Scanner Component Configuration

Basic Scanner Configuration

ElementTypeDefaultDescription
snapshotDirStringsnapshotsThe directory for snapshot files.
numOfSnapshotBackupsint2The number of snapshots to keep after processing.
waitForSubJobsTimeoutlong600000
(=10 mins)
Scanner timeout while waiting for published jobs to complete.
maxOutstandingTimeStatisticslong1mThe max about of time to wait before updating the statistics file. Whichever happens first between this property and maxOutstandingUpdatesStatistics will trigger an update to the statistics file.
maxOutstandingUpdatesStatisticslong1000The max number of files to process before updating the statistics file. Whichever happens first between this property and maxOutstandingTimeStatistics will trigger an update to the statistics file.
usesDomainbooleantrueIndicates if the group expansion request will use a domain\user format (useful for connectors that does not support domain in the group expander).

Branch Handler Configuration

This component publishes to the onAdd, onDelete and onUpdate, so a branch must be configured for each of these three events.

ElementTypeDescription
branches/branch/@eventstringThe event to configure - onAdd, onDelete or onUpdate.
branches/branch/@pipelineManagerstringThe name of the pipeline manager to publish to. Can be relative.
branches/branch/@pipelinestringThe name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager.
branches/branch/@allowRemotebooleanIndicates if this pipeline can be found on remote servers (see Distributed Processing for details).
branches/branch/@batchingbooleanIndicates if the jobs processed by this pipeline should be marked for batch processing (useful for publishers or other components that support batch processing).
branches/branch/@batchSizeintThe max size of the batches that the branch handler will created.
branches/branch/@batchTimeoutlongTime to wait before the batch is closed if the batchSize hasn't been reached.
branches/branch/@simultaneousBatchesintThe max number of simultanous batches that will be handled by the branch handler.

Configuration Example

  <component factoryName="aspire-ftp-scanner" name="Scanner" subType="scanner">
    <debug>false</debug>
    <snapshotDir>${data.dir}/FTP_Connector/snapshots</snapshotDir>
    <fileNamePatterns>
      <include pattern=".*"/>
      <exclude pattern=".*tmp$"/>
    </fileNamePatterns>
    <emitCrawlStartJobs>true</emitCrawlStartJobs>
    <emitCrawlEndJobs>true</emitCrawlEndJobs>
    <waitForSubJobsTimeout>600000</waitForSubJobsTimeout>
    <enableAuditing>true</enableAuditing>
    <failedDocumentsService/>
    <branches>
      <branch allowRemote="true" batchSize="50" batchTimeout="60000"
        batching="true" event="onAdd" pipeline="addUpdatePipeline"
        pipelineManager="../ProcessPipelineManager" simultaneousBatches="2"/>
      <branch allowRemote="true" batchSize="50" batchTimeout="60000"
        batching="true" event="onUpdate" pipeline="addUpdatePipeline"
        pipelineManager="../ProcessPipelineManager" simultaneousBatches="2"/>
      <branch allowRemote="true" batchSize="50" batchTimeout="60000"
        batching="true" event="onDelete" pipeline="deletePipeline"
        pipelineManager="../ProcessPipelineManager" simultaneousBatches="2"/>
      <branch allowRemote="true" event="onCrawlStart" pipeline="crawlStartEndPipeline"
        pipelineManager="../ProcessPipelineManager"/>
      <branch allowRemote="true" event="onCrawlEnd" pipeline="crawlStartEndPipeline"
        pipelineManager="../ProcessPipelineManager"/>
    </branches>
  </component>

Source Configuration

Scanner Control Configuration

The following table describes the list of attributes that the AspireObject of the incoming scanner job requires to correctly execute and control the flow of a scan process.

ElementTypeOptionsDescription
@actionstringstart, stop, pause, resume, abortControl command to tell the scanner which operation to perform. Use start option to launch a new crawl.
@actionPropertiesstringfull, incrementalWhen a start @action is received, it will tell the scanner to either run a full or an incremental crawl.
@normalizedCSNamestring Unique identifier name for the content source that will be crawled.
displayNamestring Display or friendly name for the content source that will be crawled.

Header Example

  <doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="0" jobNumber="0" normalizedCSName="FeedOne_Connector"
   scheduleId="0" scheduler="##AspireSystemScheduler##" sourceName="ContentSourceName">
    ...
    <displayName>testSource</displayName>
    ...
  </doc>

All configuration properties described in this section are relative to /doc/connectorSource of the AspireObject of the incoming Job.

PropertyTypeDefaultDescription
serverstring Server Name
portint The port on which the FTP server is running
urlstring The directory on the FTP server to crawl
usernamestring The username to connect with.
passwordstring The password of the username to connect with.
passiveBooleanfalseConnect to the FTP server using passive mode
indexContainersbooleanfalsetrue if folders (as well as files) should be indexed.
scanRecursivelybooleanfalsetrue if subfolders of the given URL should be scanned.
fileNamePatterns/include/@patternregexnoneOptional. A regular expression pattern to evaluate file urls against; if the file name matches the pattern, the file is included by the scanner. Multiple include nodes can be added.
fileNamePatterns/include/@patternregexnoneOptional. A regular expression pattern to evaluate file urls against; if the file name matches the pattern, the file is excluded by the scanner. Multiple exclude nodes can be added.

Scanner Configuration Example

  <doc action="start" actionProperties="full" actionType="manual" normalizedCSName="FTP_Connector" sourceName="FTP_Connector">
    <connectorSource>
      <server>ftp.searchtechnologies.com</server>
      <port>21</port>
      <url>/test</url>
      <username>sd-ftp-user</username>
      <password>encrypted:562E81591F85B858E5A5D3876F9C9FDB</password>
      <passive>true</passive>
      <indexContainers>false</indexContainers>
      <scanRecursively>true</scanRecursively>
      <scanExcludedItems>false</scanExcludedItems>
      <fileNamePatterns/>
    </connectorSource>
    <displayName>FTP_Connector</displayName>
  </doc>

Example Output

<doc>
  <url>/test/11/00/0/1.txt</url>
  <id>/test/11/00/0/1.txt</id>
  <fetchUrl>/test/11/00/0/1.txt</fetchUrl>
  <displayUrl>/test/11/00/0/1.txt</displayUrl>
  <snapshotUrl>005 /test/11/00/0/1.txt</snapshotUrl>
  <doFetch>true</doFetch>
  <doPopulate>true</doPopulate>
  <docType>item</docType>
  <lastModified>Thu Jun 25 11:42:00 BST 2015</lastModified>
  <dataSize>4264</dataSize>
  <repItemType>aspire/file</repItemType>
  <sourceName>FTP Connector</sourceName>
  <sourceType>ftp</sourceType>
  <connectorSource type="ftp">
    <server>ftp.searchtechnologies.com</server>
    <port>21</port>
    <url>/test</url>
    <username>sd-ftp-user</username>
     <password>encrypted:562E81591F85B858E5A5D3876F9C9FDB</password>
    <passive>true</passive>
    <indexContainers>false</indexContainers>
    <scanRecursively>true</scanRecursively>
    <scanExcludedItems>false</scanExcludedItems>
    <fileNamePatterns/>
    <displayName>FTP Connector</displayName>
  </connectorSource>
  <action>add</action>
  <hierarchy>
    <item id="E392FAB00D12B2340E5BE938C982ABBA" level="5" name="1.txt" url="/test/11/00/0/1.txt">
      <ancestors>
        <ancestor id="0891335A4083FCD65DD995A58E23EF39" level="4" name="0" parent="true" type="aspire/folder" url="/test/11/00/0"/>
        <ancestor id="46FFA5AD9FAD0068CE164E1B5D7917E1" level="3" name="00" type="aspire/folder" url="/test/11/00"/>
        <ancestor id="1C53A23263BBF5898E320F633910B6F6" level="2" name="11" type="aspire/folder" url="/test/11"/>
        <ancestor id="4539330648B80F94EF3BF911F6D77AC9" level="1" name="FTP Connector" type="aspire/folder" url="/test"/>
      </ancestors>
    </item>
  </hierarchy>
  <contentType source="ExtractTextStage/Content-Type">text/plain; charset=windows-1252</contentType>
  <extension source="ExtractTextStage">
    <field name="Content-Encoding">windows-1252</field>
    <field name="X-Parsed-By">org.apache.tika.parser.DefaultParser</field>
    <field name="resourceName">/test/11/00/0/1.txt</field>
  </extension>
  <content source="ExtractTextStage"><![CDATA[A brutal blast of arctic air has settled over eastern North America, bringing dangerously low temperatures not seen in decades.

About half of the US population has been placed under a wind chill warning or cold weather advisory.

In Toronto, the temperature dropped to -24C (-11F) before dawn on Tuesday.

Air, rail and road travel remain snarled by high, freezing wind, and residents have been warned to stay indoors to avoid frostbite.

Cold air broke records in Chicago on Monday, where the temperature of -16F (-27C) was the lowest ever seen on that date.

It was one of more than 120 daily temperature records broken in cities across the US since the beginning of 2014, many dating back decades.
Sharp temperature drop

Chicagoans explain how they cope with the extreme weather

The arrival late on Monday of the arctic weather pattern caused temperatures to plummet overnight in New York and Washington DC by as much as 45 degrees in a matter of hours, from unseasonably warm highs a day earlier.

New York Governor Andrew Cuomo closed parts of major highways across his state in preparation for the extreme weather.

Adding to the misery, forecasters say the areas on the eastern shores of the Great Lakes could again be blanketed by snow, as the cold air moved over the water.

In Canada, 4,000 residents of Quebec and 1,000 in Newfoundland were still without power on Tuesday amid the freezing temperatures and snow.

The polar blast was threatening crops and livestock across the American farm belt, even in the usually temperate Deep South. The freeze was expected to reach as far south as Texas and central Florida, the National Weather Service said.

Meteorologists said some 187 million people in all would feel the effects of the cold by Tuesday.
Transport trouble

The frigid temperatures have been widely blamed on a shift in the weather pattern known as the "polar vortex".

What can you wear to help cope with extreme cold weather?

On Tuesday, the extreme weather caused the cancellation of 2,500 flights, along with widespread road and rail delays.

JetBlue Airways operations, which had been suspended at airports in Boston and around New York City, were returning to normal.

More than 500 passengers on their way to Chicago were stuck overnight in northern Illinois on three Amtrak passenger trains after drifting snow and ice covered the tracks.

And in Indianapolis, Indiana, it has temporarily been made illegal to drive except in an emergency or to seek shelter, in order to keep the roads free for emergency vehicles.

Cold temperatures reached deep into the US south-east.

The weather has been blamed for at least 16 deaths in recent days, including:

    A one-year-old boy in Missouri who was killed in a car collision with a snowplough
    A worker at a Philadelphia salt storage facility who died when a 100-ft (30-m) pile of road salt collapsed on him
    Four men across Illinois who suffered fatal heart attacks while shovelling snow

Frostbite graphic

The state of Minnesota and the city of Chicago, Illinois, have ordered all schools closed.

It was so cold that even the polar bear at Chicago's Lincoln Park Zoo was kept indoors, CNN reports.

In Kentucky, an inmate who escaped a minimum security prison turned himself in to get out of the cold, the Associated Press reported.

Some relief was in sight in the Midwest, as the cold air pattern moved eastward, the National Weather Service said.
A pedestrian walks past a mural depicting a winter scene in Montreal, Quebec, on 7 January 2014 A pedestrian walks past a mural depicting a winter scene in Montreal, Quebec
A man warms himself near a fire in Indianapolis, Indiana, on 7 January 2014 A man warms himself before a fire in Indianapolis, Indiana
Passengers wait for a train in below-zero temperatures in Chicago, Illinois, on 7 January 2014 Passengers wait for a train in below-zero temperatures in Chicago, Illinois
A man walks past a snow encrusted bicycle in Chicago on 7 January 2014 A frozen bicycle in downtown Chicago on Tuesday
A salesmen at a car dealer digs out cars covered in snow in Indianapolis, Indiana, on 7 January 2014 A salesmen digs out cars at a dealership in Indianapolis, Indiana 
]]></content>
</doc>
 
  • No labels