Configuration
This section lists all configuration parameters available to configure the FTP Scanner component.
General Scanner Component Configuration
Basic Scanner Configuration
Element | Type | Default | Description |
---|---|---|---|
snapshotDir | String | snapshots | The directory for snapshot files. |
numOfSnapshotBackups | int | 2 | The number of snapshots to keep after processing. |
waitForSubJobsTimeout | long | 600000 (=10 mins) | Scanner timeout while waiting for published jobs to complete. |
maxOutstandingTimeStatistics | long | 1m | The max about of time to wait before updating the statistics file. Whichever happens first between this property and maxOutstandingUpdatesStatistics will trigger an update to the statistics file. |
maxOutstandingUpdatesStatistics | long | 1000 | The max number of files to process before updating the statistics file. Whichever happens first between this property and maxOutstandingTimeStatistics will trigger an update to the statistics file. |
usesDomain | boolean | true | Indicates if the group expansion request will use a domain\user format (useful for connectors that does not support domain in the group expander). |
Branch Handler Configuration
This component publishes to the onAdd, onDelete and onUpdate, so a branch must be configured for each of these three events.
Element | Type | Description |
---|---|---|
branches/branch/@event | string | The event to configure - onAdd, onDelete or onUpdate. |
branches/branch/@pipelineManager | string | The name of the pipeline manager to publish to. Can be relative. |
branches/branch/@pipeline | string | The name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager. |
branches/branch/@allowRemote | boolean | Indicates if this pipeline can be found on remote servers (see Distributed Processing for details). |
branches/branch/@batching | boolean | Indicates if the jobs processed by this pipeline should be marked for batch processing (useful for publishers or other components that support batch processing). |
branches/branch/@batchSize | int | The max size of the batches that the branch handler will created. |
branches/branch/@batchTimeout | long | Time to wait before the batch is closed if the batchSize hasn't been reached. |
branches/branch/@simultaneousBatches | int | The max number of simultanous batches that will be handled by the branch handler. |
Configuration Example
<component factoryName="aspire-rss-scanner" name="Scanner" subType="scanner"> <debug>false</debug> <fullRecovery>incremental</fullRecovery> <incrementalRecovery>incremental</incrementalRecovery> <snapshotDir>$(data.dir}/RSS_Connector/snapshots</snapshotDir> <emitCrawlStartJobs>true</emitCrawlStartJobs> <emitCrawlEndJobs>true</emitCrawlEndJobs> <waitForSubJobsTimeout>600000</waitForSubJobsTimeout> <enableAuditing>true</enableAuditing> <failedDocumentsService/> <branches> <branch allowRemote="true" batchSize="50" batchTimeout="60000" batching="true" event="onAdd" pipeline="addUpdatePipeline" pipelineManager="../ProcessPipelineManager" simultaneousBatches="2"/> <branch allowRemote="true" batchSize="50" batchTimeout="60000" batching="true" event="onUpdate" pipeline="addUpdatePipeline" pipelineManager="../ProcessPipelineManager" simultaneousBatches="2"/> <branch allowRemote="true" batchSize="50" batchTimeout="60000" batching="true" event="onDelete" pipeline="deletePipeline" pipelineManager="../ProcessPipelineManager" simultaneousBatches="2"/> <branch allowRemote="true" event="onCrawlStart" pipeline="crawlStartEndPipeline" pipelineManager="../ProcessPipelineManager"/> <branch allowRemote="true" event="onCrawlEnd" pipeline="crawlStartEndPipeline" pipelineManager="../ProcessPipelineManager"/> </branches> </component>
Source Configuration
Scanner Control Configuration
The following table describes the list of attributes that the AspireObject of the incoming scanner job requires to correctly execute and control the flow of a scan process.
Element | Type | Options | Description |
---|---|---|---|
@action | string | start, stop, pause, resume, abort | Control command to tell the scanner which operation to perform. Use start option to launch a new crawl. |
@actionProperties | string | full, incremental | When a start @action is received, it will tell the scanner to either run a full or an incremental crawl. |
@normalizedCSName | string | Unique identifier name for the content source that will be crawled. | |
displayName | string | Display or friendly name for the content source that will be crawled. |
Header Example
<doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="0" jobNumber="0" normalizedCSName="FeedOne_Connector" scheduleId="0" scheduler="##AspireSystemScheduler##" sourceName="ContentSourceName"> ... <displayName>testSource</displayName> ... </doc>
All configuration properties described in this section are relative to /doc/connectorSource of the AspireObject of the incoming Job.
Property | Type | Default | Description |
---|---|---|---|
feedUrls/feedUrl (Multiple allowed) | string | The URL of the RSS feed to crawl | |
indexContainers | boolean | false | true if folders (as well as files) should be indexed. |
Scanner Configuration Example
<doc action="start" actionProperties="full" actionType="manual" normalizedCSName="FTP_Connector" sourceName="FTP_Connector"> <connectorSource> <feedUrls> <feedUrl>http://feeds.bbci.co.uk/news/rss.xml?edition=uk</feedUrl> <feedUrl>http://feeds.skynews.com/feeds/rss/world.xml</feedUrl> <feedUrl>http://feeds.skynews.com/feeds/rss/us.xml</feedUrl> <feedUrl>http://www.telegraph.co.uk/news/uknews/rss</feedUrl> <feedUrl>http://www.telegraph.co.uk/news/rss</feedUrl> <feedUrl>http://www.telegraph.co.uk/sport/rss</feedUrl> </feedUrls> <indexContainers>false</indexContainers> </connectorSource> <displayName>FTP_Connector</displayName> </doc>
Example Output
<doc> <url>http://www.bbc.co.uk/news/world-europe-33284937#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</url> <id>http://www.bbc.co.uk/news/world-europe-33284937#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</id> <fetchUrl>http://www.bbc.co.uk/news/world-europe-33284937#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</fetchUrl> <displayUrl>http://www.bbc.co.uk/news/world-europe-33284937#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</displayUrl> <snapshotUrl>001 http://www.bbc.co.uk/news/world-europe-33284937#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</snapshotUrl> <doFetch>true</doFetch> <doPopulate>true</doPopulate> <docType>item</docType> <repItemType>aspire/document</repItemType> <createdBy/> <connectorSpecific> <field name="comments"/> <field name="link">http://www.bbc.co.uk/news/world-europe-33284937#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</field> <field name="description">One man decapitated and several hurt in suspected Islamist attack on factory near Lyon, French sources say.</field> <field name="title">'Man decapitated' in French attack</field> <field name="uri">http://www.bbc.co.uk/news/world-europe-33284937</field> </connectorSpecific> <lastModified>2015-06-26T09:39:05Z</lastModified> <connectorSource type="rss"> <feedUrls> <feedUrl>http://feeds.bbci.co.uk/news/rss.xml?edition=uk</feedUrl> <feedUrl>http://feeds.skynews.com/feeds/rss/uk.xml</feedUrl> </feedUrls> <indexContainers>false</indexContainers> <displayName>RSS Connector</displayName> </connectorSource> <action>add</action> <hierarchy> <item id="12FEEA65F246F754CD14CDE5AD5029D6" level="1" url="http://www.bbc.co.uk/news/world-europe-33284937#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa"/> </hierarchy> <httpResponse code="200" source="FetchURLStage">OK</httpResponse> <protocol source="FetchURLStage/protocol">http</protocol> <host source="FetchURLStage/host">www.bbc.co.uk</host> <mimeType source="FetchURLStage/mimeType">text/html</mimeType> <encoding source="FetchURLStage/encoding">utf-8</encoding> <extension source="FetchURLStage"> <field name="status">HTTP/1.1 200 OK</field> <field name="Server">Apache</field> <field name="Content-Type">text/html; charset=utf-8</field> <field name="X-News-Data-Centre">cwwtf</field> <field name="Content-Language">en-GB</field> <field name="X-PAL-Host">pal071.back.live.cwwtf.local:80</field> <field name="X-News-Cache-Id">97231</field> <field name="Content-Length">115141</field> <field name="Date">Fri, 26 Jun 2015 09:59:47 GMT</field> <field name="Connection">keep-alive</field> <field name="Set-Cookie">BBC-UID=9535987d82a209d365620e344160f72abcd7544074e4c1aeba21b4b40e9538d00Java/1.7.0_67; expires=Tue, 25-Jun-19 09:59:47 GMT; path=/; domain=.bbc.co.uk</field> <field name="Cache-Control">private, max-age=60, stale-while-revalidate</field> <field name="X-Cache-Action">HIT</field> <field name="X-Cache-Hits">54</field> <field name="X-Cache-Age">49</field> <field name="X-LB-NoCache">true</field> <field name="Vary">X-CDN,X-BBC-Edge-Cache,Accept-Encoding</field> </extension> <title source="ExtractTextStage/title">'Man decapitated' in French attack - BBC News</title> <contentType source="ExtractTextStage/Content-Type">text/html; charset=UTF-8</contentType> <description source="ExtractTextStage/description">One man decapitated and several hurt in suspected Islamist attack on factory near Lyon, French sources say.</description> <extension source="ExtractTextStage"> <field name="robots">NOODP,NOYDIR</field> <field name="og:type">article</field> <field name="twitter:title">'Man decapitated' in French attack - BBC News</field> <field name="x-audience">Domestic</field> <field name="twitter:domain">www.bbc.co.uk</field> <field name="og:locale">en_GB</field> <field name="msapplication-TileImage">http://static.bbci.co.uk/news/1.75.0364/windows-eight-icon-144x144.png</field> <field name="x-country">gb</field> <field name="cleartype">on</field> <field name="og:article:author">BBC News</field> <field name="resourceName">http://www.bbc.co.uk/news/world-europe-33284937#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</field> <field name="CPS_AUDIENCE">Domestic</field> <field name="dc:title">'Man decapitated' in French attack - BBC News</field> <field name="viewport">width=device-width, initial-scale=1, user-scalable=1</field> <field name="twitter:creator">@BBCNews</field> <field name="theme-color">#bb1919</field> <field name="og:title">'Man decapitated' in French attack - BBC News</field> <field name="og:article:section">Europe</field> <field name="application-name">BBC News</field> <field name="mobile-web-app-capable">yes</field> <field name="og:description">One man decapitated and several hurt in suspected Islamist attack on factory near Lyon, French sources say.</field> <field name="X-UA-Compatible">IE=edge,chrome=1</field> <field name="apple-mobile-web-app-title">BBC News</field> <field name="twitter:card">summary_large_image</field> <field name="X-Parsed-By">org.apache.tika.parser.DefaultParser</field> <field name="og:site_name">BBC News</field> <field name="og:url">http://www.bbc.co.uk/news/world-europe-33284937</field> <field name="og:image">http://ichef.bbci.co.uk/news/1024/media/images/77623000/png/_77623460_breaking_image_large-3.png</field> <field name="msapplication-TileColor">#bb1919</field> <field name="Content-Location">http://www.bbc.co.uk/news/world-europe-33284937#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</field> <field name="Content-Encoding">UTF-8</field> <field name="twitter:description">One man decapitated and several hurt in suspected Islamist attack on factory near Lyon, French sources say.</field> <field name="twitter:site">@BBCNews</field> <field name="twitter:image:src">http://ichef.bbci.co.uk/news/560/media/images/77623000/png/_77623460_breaking_image_large-3.png</field> </extension> <content source="ExtractTextStage"><![CDATA[ 'Man decapitated' in French attack 26 June 2015 From the section Europe A man has been beheaded and at least one other person injured in a suspected Islamist attack on a factory near the French city of Lyon. Several small explosive devices were also set off at the Air Products factory in Saint-Quentin-Fallavier, sources said. The alleged attacker is said to have been carrying an Islamist flag, which was found nearby. A man has been arrested, officials say. Interior Minister Bernard Cazeneuve is said to be on his way to the scene. Copyright © 2015 BBC. The BBC is not responsible for the content of external sites. Read about our approach to external linking. ]]></content> </doc>
Overview
Content Tools