Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This section describes in details each of the components involved in the Connector Framework.

Component's Bundle

The main component's bundle jar component the framework uses is aspire-connector-framework, this bundle contains all the Stages, components and also provides interfaces for the specific connector implementations.

Table of Contents
minLevel4


Connector AppBundle

In Aspire every component needs to be referenced from an AppBundle or application.xml file, which describes the job execution flow. For the Connector Framework we have one common AppBundle called app-rap-connector.

This AppBundle is automatically loaded when Aspire detects it needs to load a connector. It contains all the PipelineManagers, Pipelines and references to the Connector Framework Components and Stages from the aspire-connector-framework bundle.

PipelineManagers and Pipelines

Main (PipelineManager)

  • controlPipeline

QueuePipelineManager

  • jobStartEndPipeline
  • crawlEndPipeline

ScanPipelineManager

  • scanControlPipeline1
  • scanControlPipeline2
  • scanErrorPipeline

ScanChildrenPipelineManager

  • scannedItemsPipeline

ProcessPipelineManager

  • crawlStartEndPipeline
  • crawlStartEndErrorPipeline
  • processControlPipeline1
  • processControlPipeline2
  • addUpdatePipeline
  • fetchAndExtractPipeline
  • addUpdateWorkflowPipeline
  • publishWorkflowPipeline
  • deletePipeline
  • deleteWorkflowPipeline
  • errorPipeline

 

 

 

 

 

 

 


Components

CrawlController

The CrawlController is the main entry point for incoming crawl start signals, also it controls the ConnectionPool and manages the NoSQLConnections to Mongo used by the rest of the components. It also handles the distributed crawl start and synchronizes the crawl status with Mongo so all Aspire servers have the same.

ScanQueueLoader

It is an instance of the QueueLoader class a component configured to claim items from the scanQueue collection in Mongo, marks each item as in-progress "P" in Mongo (so no other server claims the same item for scanning) and enqueues them jobs into the ScanPipelineManager. All the items claimed by this component are containers that should be scanned.

ProcessQueueLoader

It is an instance of the QueueLoader class a component configured to claim items from the processQueue collection in Mongo, marks each item as in-progress "P" in Mongo (so no other server claims the same item for processing) and enqueues them as jobs into the ProcessPipelineManager. All the items claimed by this components are items that should be considered for workflow processing.

ProcessDeletes

 

CrawlEnd

ScanReleaseController

Scan

FlagContainer

IncludeExclude

CheckSnapshot

GenerateHierarchy

EnqueueScan

AddUpdateSnapshot

EnqueueProcess

ProcessReleaseController

ProcessCrawlRoot

PopulateOrDelete

FetchUrl

EnqueueScan

AddUpdateSnapshot

MarkProcessComplete