Configuration
This section lists all configuration parameters available to configure the Heritrix Scanner Component.
General Scanner Component Configuration
Basic Scanner Configuration
Element | Type | Default | Description |
---|---|---|---|
snapshotDir | String | snapshots | The directory for snapshot files. |
numOfSnapshotBackups | int | 2 | The number of snapshots to keep after processing. |
waitForSubJobsTimeout | long | 600000 (=10 mins) | Scanner timeout while waiting for published jobs to complete. |
maxOutstandingTimeStatistics | long | 1m | The max about of time to wait before updating the statistics file. Whichever happens first between this property and maxOutstandingUpdatesStatistics will trigger an update to the statistics file. |
maxOutstandingUpdatesStatistics | long | 1000 | The max number of files to process before updating the statistics file. Whichever happens first between this property and maxOutstandingTimeStatistics will trigger an update to the statistics file. |
usesDomain | boolean | true | Indicates if the group expansion request will use a domain\user format (useful for connectors that does not support domain in the group expander). |
Branch Handler Configuration
This component publishes to the onAdd, onDelete and onUpdate, so a branch must be configured for each of these three events.
Element | Type | Description |
---|---|---|
branches/branch/@event | string | The event to configure - onAdd, onDelete or onUpdate. |
branches/branch/@pipelineManager | string | The name of the pipeline manager to publish to. Can be relative. |
branches/branch/@pipeline | string | The name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager. |
branches/branch/@allowRemote | boolean | Indicates if this pipeline can be found on remote servers (see Distributed Processing for details). |
branches/branch/@batching | boolean | Indicates if the jobs processed by this pipeline should be marked for batch processing (useful for publishers or other components that support batch processing). |
branches/branch/@batchSize | int | The max size of the batches that the branch handler will created. |
branches/branch/@batchTimeout | long | Time to wait before the batch is closed if the batchSize hasn't been reached. |
branches/branch/@simultaneousBatches | int | The max number of simultanous batches that will be handled by the branch handler. |
Heritrix Specific Configuration
Element | Type | Default | Description |
---|---|---|---|
jobsFolder | string | ${app.data.dir}/heritrixJobs | Directory where heritrix engine will find jobs to execute. |
jdbmDir | string | {app.data.dir}/incremental | Directory path where the Heritrix Scanner will store incremental crawls data. The Scanner will create one database folder per source being crawled. |
checkpointIntervalMinutes | int | 15 | Interval in minutes between each checkpoint made by the Heritrix engine |
branches/branch/@event | string | none | Same as all scanner branches plus two other more: processUncrawled and checkForDeletion that will send jobs when a crawl is completed and there are documents that have to be checked for deletion. If checkNotCrawlableContent is set in the input Job, then the uncrawled old documents will be sent to the processUncrawled so you can verify if it is still accessible, if that flag is not set, then the possible deleted documents will be branched to the checkForDeletion branch. For more information on how to implement this check the application.xml inside the Heritrix Application Bundle |
Configuration Example
<component name="Scanner" subType="default" factoryName="aspire-heritrix-connector"> <debug>true</debug> <jdbmDir>C:\heritrixDB</jdbmDir> <jobsFolder>${app.data.dir}/heritrixJobs</jobsFolder> <checkpointIntervalMinutes>15</checkpointIntervalMinutes> <branches> <branch event="onAdd" pipelineManager="../ProcessPipelineManager" pipeline="processPagePipeline" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> <branch event="onUpdate" pipelineManager="../ProcessPipelineManager" pipeline="processPagePipeline" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> <branch event="onDelete" pipelineManager="../ProcessPipelineManager" pipeline="processDeletePipeline" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> <branch event="processUncrawled" pipelineManager="../ProcessPipelineManager" pipeline="processUncrawledPipeline" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> <branch event="checkForDeletion" pipelineManager="../ProcessPipelineManager" pipeline="checkDeleteLimitsPipeline" batching="true" batchSize="50" batchTimeout="60000" simultaneousBatches="2" /> </branches> </component>
Source Configuration
Scanner Control Configuration
The following table describes the list of attributes that the AspireObject of the incoming scanner job requires to correctly execute and control the flow of a scan process.
Element | Type | Options | Description |
---|---|---|---|
@action | string | start, stop, pause, resume, abort | Control command to tell the scanner which operation to perform. Use start option to launch a new crawl. |
@actionProperties | string | full, incremental | When a start @action is received, it will tell the scanner to either run a full or an incremental crawl. |
@normalizedCSName | string | Unique identifier name for the content source that will be crawled. | |
displayName | string | Display or friendly name for the content source that will be crawled. |
Header Example
<doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="0" jobNumber="0" normalizedCSName="FeedOne_Connector" scheduleId="0" scheduler="##AspireSystemScheduler##" sourceName="ContentSourceName"> ... <displayName>testSource</displayName> ... </doc>
All configuration properties described in this section are relative to /doc/connectorSource of the Aspire Object of the incoming Job.
Property | Type | Default | Description |
---|---|---|---|
defaultConfigFile | Boolean | true | Specifies the Heritrix job configuration to use for the source, Standard Configuration will use a default configuration file with some user specific parameters (see next properties). Custom Configuration will use a customized configuration file. |
url | string | none | A list of seed URLs, one per line to start crawling from. |
crawlScope | string | All | Selects the crawl scope for the job: All, Stay within Domain or Stay within Host. |
maxHops | int | 3 | Specifies the number of allowed hops to crawl. |
millisecondsPerRequest | int | 3000 | The number of milliseconds to wait between each request made by the Heritrix Crawl Engine during the crawl. |
seedsRetry/@maxRetries | int | 5 | Number of retries for failed seeds. |
seedsRetry/@retryDelay | int | 20 | Time in seconds to wait between retries for failed seeds |
crawlPatterns/accept/@pattern | regex | none | Optional. A regular expression pattern to evaluate URLs against; if the URL matches the pattern, the URL is accepted by the crawler. |
crawlPatterns/accept/@pattern | regex | none | Optional. A regular expression pattern to evaluate URLs against; if the URL matches the pattern, the URL is rejected by the crawler. |
configFileLocation | string | Location of a custom Heritrix job configuration file (crawler-beans.cxml). This file requires the AspireHeritrixProcessor to be configured in the Disposition chain. | |
cleanupRegex | string | Optional. Regular Expression used to clean the content of a web page, by removing all matches of the regex in the content, before it gets to the Extract Text stage. It can be used to exclude dynamic content from index. | |
defaultIncrementalIndexing | boolean | false | Determines if there are custom values for incremental indexing such as daysToDelete, maxFailuresToDelete, checkNotCrawlableContent or uncrawledAccessDelay |
daysToDelete | integer | 2 | Number of days to wait before deleting an uncrawled/not accessible URL. |
maxFailuresToDelete | integer | 5 | Number of incremental iterations to wait before deleting an uncrawled/not accessible URL. |
checkNotCrawlableContent | boolean | false | Determines if the Heritrix Scanner should verify URLs which are no longer reachable from other URLs (example: if a referring site was deleted). Otherwise those URLs will be marked as failed (and then deleted). The first time a URL is detected as not crawlable and it is still available, the scanner will send an UPDATE action for it, when it becomes crawlable again another UPDATE action will be sent for it. |
uncrawledAccessDelay | integer | 2000 | Time in milliseconds to wait between checks (for old and failed URLs) from the same host. |
fileNamePatterns/include/@pattern | regex | none | Optional. A regular expression pattern to evaluate file urls against; if the file name matches the pattern, the file is included by the scanner. Multiple include nodes can be added. |
fileNamePatterns/exclude/@pattern | regex | none | Optional. A regular expression pattern to evaluate file urls against; if the file name matches the pattern, the file is excluded by the scanner. Multiple exclude nodes can be added. |
Scanner Configuration Example
<doc action="start" actionProperties="full" normalizedCSName="ST_Web_Site"> <connectorSource> <defaultConfigFile>true</defaultConfigFile> <url>http://www.searchtechnologies.com</url> <crawlScope>all</crawlScope> <maxHops>3</maxHops> <seedsRetry maxRetries="5" retryDelay="20"/> <millisecondsPerRequest>3000</millisecondsPerRequest> <crawlPatterns> <accept pattern="\.html$"/> <reject pattern="\.js$"/> <reject pattern="\.css$"/> </crawlPatterns> <defaultIncrementalIndexing>true</defaultIncrementalIndexing> <fetchDelay>500</fetchDelay> <daysToDelete>2</daysToDelete> <maxFailuresToDelete>5</maxFailuresToDelete> <checkNotCrawlableContent>true</checkNotCrawlableContent> <uncrawledAccessDelay>2000</uncrawledAccessDelay> <cleanupRegex><!--googleoff: all-->[\s\S]*<!--googleon: all--></cleanupRegex> <fileNamePatterns> <include pattern=".*"/> <exclude pattern=".*robots.txt.*"/> </fileNamePatterns> </connectorSource> </doc>
or using a custom crawler beans:
Note: To launch a crawl, the job should be sent (processed/enqueued) to the "/FileSystemConnector/Main" pipeline.
Output
<doc> <docType>item</docType> <url>http://www.searchtechnologies.com/</url> <id>http://www.searchtechnologies.com/</id> <fetchUrl>http://www.searchtechnologies.com/</fetchUrl> <displayUrl>http://www.searchtechnologies.com/</displayUrl> <snapshotUrl>001 http://www.searchtechnologies.com/</snapshotUrl> <sourceType>heritrix</sourceType> <sourceName>ST_Web_Site</sourceName> <connectorSpecific type="heritrix"> <field name="md5">IFAUKQSBGFCUCNJVGMYTOQ2FHA4EEMRWIJDDCMJWII4TQM2CGIZA</field> <field name="xslt">false</field> <field name="discoveredBy"/> <field name="pathFromSeed"/> </connectorSpecific> <connectorSource> <defaultConfigFile>true</defaultConfigFile> <url>http://www.searchtechnologies.com</url> <crawlScope>all</crawlScope> <maxHops>3</maxHops> <seedsRetry maxRetries="5" retryDelay="20"/> <millisecondsPerRequest>3000</millisecondsPerRequest> <crawlPatterns/> <defaultIncrementalIndexing>false</defaultIncrementalIndexing> <cleanupRegex><!--googleoff: all-->[\s\S]*<!--googleon: all--></cleanupRegex> <fileNamePatterns> <include pattern=".*"/> <exclude pattern=".*robots.txt.*"/> </fileNamePatterns> <displayName>ST Web Site</displayName> </connectorSource> <action>add</action> <hierarchy> <item id="ADDFF324E6D09222031F87DA77854D50" level="1" name="ST_Web_Site" url="http://www.searchtechnologies.com/"/> </hierarchy> </doc>
Heritrix Configuration File
Standard Configuration
- Sets the Seed URLs to the TextSeedModule bean.
- Uses the following Decide Rules to configure the crawl scope (in this order):
- RejectDecideRule
- REJECT
- SurtPrefixedDecideRule
- ACCEPT
- MatchesListRegexDecideRule
- ACCEPT all URLs that match a regex in the list of accept patterns configured by the user.
- FetchStatusMatchesRegexDecideRule
- ACCEPT all URLs with fetch status of 200-300
- FetchStatusMatchesRegexDecideRule
- REJECT all URLs with fetch status of 400-500
- TooManyHopsDecideRule
- REJECT all URLs after the number of maximum hops defined by the user.
- TransclusionDecideRule
- ACCEPT
- NotOnDomainsDecideRule/NotOnHostsDecideRule
- Depending on users choice:
- All (NotOnDomainsDecideRule -> ACCEPT)
- Stay within Domain (NotOnDomainsDecideRule -> REJECT)
- Stay within Host (NotOnHostsDecideRule -> REJECT)
- Depending on users choice:
- SurtPrefixedDecideRule
- REJECT those configured on the negative-surts.dump file, initially empty
- MatchesListRegexDecideRule
- REJECT all URLs that match a regex in the list of reject patterns configured by the user.
- PathologicalPathDecideRule
- REJECT
- TooManyPathSegmentsDecideRule
- REJECT
- PrerequisiteAcceptDecideRule
- ACCEPT (robots.txt for example)
- SchemeNotInSetDecideRule
- REJECT
- RejectDecideRule
- Uses the AspireHeritrixProcessor on the disposition chain
<bean id="aspireProcessor" class="com.searchtechnologies.aspire.components.heritrixconnector.AspireHeritrixProcessor"/>
Custom Configuration
The custom configuration file can be configured to use any Heritrix feature available for the standalone version, but instead of using the WARCWriterProcessor as the first step on the DispositionChain, it requires the AspireHeritrixProcessor:
<bean id="aspireProcessor" class="com.searchtechnologies.aspire.components.heritrixconnector.AspireHeritrixProcessor"/>
It is required to configure the following digest properties, since they are used for incremental indexing.
<bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP"> <property name="digestContent" value="true" /> <property name="digestAlgorithm" value="md5" /> </bean>
Example configuration file for Aspire Heritrix Connector: File:Crawler-beans.xml.
Aspire Heritrix Connector uses a custom Heritrix Engine that can handle NTLM authentication. For details on how to configure NTLM authentication see Using a Custom Heritrix Configuration File
Our custom Heritrix engine can handle (if desired) XSL transformations to extract links from XSLT generated HTML, and will also perform the data extraction from the HTML, not the original XML. For details on how to enable XSLT see Using a Custom Heritrix Configuration File