The Aspire 3.0 Connectors implement an interface called RepositoryAccessProvider that specifies the minimum required methods to access, fetch, and scan a given Repository. 

The Aspire 3.0 Connector Framework layer provides common control code for

  • Full / incremental crawling
  • Distributed processing
  • Group expansion
  • Schedules
  • Link between the Aspire Admin User Interface and the crawls

by calling the RepositoryAccessProvider method when it needs to access the Repository.

On this page:

The separation of the Connector Framework and the Connector Implementation allows a very natural usage of the Connector Implementation outside of the Connector Framework (and outside of Aspire at all).

There are five main tasks that the RepositoryAccessProvider is responsible for:

  1. Initializing the crawl configuration or SourceInfo, from the user configuration properties
  2. Extract the initial or root crawl items
  3. Populate the extracted items metadata
  4. Scan a container item 
  5. Fetch the content stream for an item
    • Open an input stream of the content of each of the items if available
    • Method: getFetcher()

Putting the Steps Together

Putting it all together is the responsibility of either the Connector Framework or you as a stand-alone developer of a given connector implementation.

The following procedure is based on the project:

Step 1 

Create a Java Maven project and import the following dependencies into its pom.xml file:

Pom dependencies

Step 2

Initialize the crawl and configuration objects.

Crawl Initialization
public static void main(String[] args) {  
    //Get the scan properties from the arguments
    String scanPropertiesFile = args[0];
    //Instantiate a new RepositoryAccessProvider from the connector implementation
    ComponentImpl component = new SP2013RAP();
    RepositoryAccessProvider rap = (RepositoryAccessProvider) component;
    //Create a CrawlControllerImpl, as some connectors depends on it
    StandaloneCrawlController crawlCtrlImpl = new StandaloneCrawlController(rap);
    //Load the content-source.xml configuration into an AspireObject
    AspireObject scanProps = new AspireObject("doc");
    AspireObject scanPropsFile = AspireObject.createFromXML(new File(scanPropertiesFile));
    //Create and initialize a new SourceInfo from the RAP
    SourceInfo info = rap.newSourceInfo(scanPropsFile);

Step 3

Start the crawl.

Start crawl call
    //execute the crawl
    crawl(rap, info, Main::downloadStream);

crawl() method

crawl method
  private static void crawl(RepositoryAccessProvider rap, SourceInfo info, Consumer<SourceItem> processor) throws AspireException {
    //The ScanListener which maintain the local processQueue and listen for new items to crawl
    Scanner scanner = new Scanner(rap, info);
    //Create the crawlRoot item to initialize the crawl from the RAP
    SourceItem crawlRoot = new SourceItem("crawlRoot");
    rap.processCrawlRoot(crawlRoot, info, scanner);
    //Crawls the local processQueue while it is not empty
    // when empty, it means the crawl finished
    do {
    } while (!scanner.isQueueEmpty());

downloadStream() method

downloadStream method
  private static void downloadStream(SourceItem item) {
    System.out.println("Item: "+item.getName());
    try {
      InputStream is = item.getContentStream();
      if (is != null) {
        FileOutputStream fos = new FileOutputStream(new File("output/"+item.getName()));
        copyStream(is, fos);
    } catch (Exception e) {

Step 4

Call the Scanner class (where the crawl occurs). When calling the processQueue() method, the following happens:

  1. The current items are moved to another queue to be iterated.
  2. For each item in the queue:
    1. Checks if the item is a container
    2. Calls the populate method on the items that need to be processed
    3. Calls the fetcher to process the item
    4. If the item is a container, calls the scan() method with it (This adds more items into the queue.)
Scanner class
  static class Scanner implements ScanListener {

     * The queue that receives all the new items discovered
    ArrayList<SourceItem> queue;
     * Temporary queue used for iterate the original queue
    ArrayList<SourceItem> safeQueue;
    private RepositoryAccessProvider rap;
    private SourceInfo info;
    private FetchURL fetcher;
    public Scanner(RepositoryAccessProvider rap, SourceInfo info) throws AspireException {
      this.rap = rap; = info;
      queue = new ArrayList<SourceItem>();
      safeQueue = new ArrayList<SourceItem>();
      fetcher = new StandaloneFetchURL(rap);
    public void close() {

    public void addItem(SourceItem item) {
      //This gets called when the RAP scanner adds an item to crawl

    public void addItems(List<SourceItem> items) {
      //This gets called when the RAP scanner adds items to crawl
    public void processQueue(Consumer<SourceItem> processor) throws AspireException {
      RepositoryConnection conn = rap.newConnection(info);
      //Move the items from the original queue into the safeQueue
      //And clear the orignal
      for (SourceItem item : safeQueue) {
        boolean container = false;
        if (rap.isContainer(item, conn)) {
          container = true;
        if (info.indexContainers() || !container) {
          rap.populate(item, info, conn);
          //Call the fetcher
          Job job = JobFactory.newInstance(item.generateJobDocument());
          job.put("sourceInfo", info);
          job.put("crawlController", info.getCrawlController());
          item.setContentStream((InputStream) job.get("contentStream"));
        if (container) {
          rap.scan(item, info, conn, this);
    public boolean isQueueEmpty() {
      return queue.size()+safeQueue.size() == 0;

Step 5

Call the Fetcher class. This is required to extend the getComponent method and return the correct RAP object.

Fetcher class
  private static class StandaloneFetchURL extends FetchURL {

    public static final String RAP = "rap";
    RepositoryAccessProvider rap;
    public StandaloneFetchURL(RepositoryAccessProvider rap) {
      this.rap = rap;
    public Component getComponent(String path) {
      if (RAP.equals(path)) {
        return (Component)rap;
      return null;

StandaloneCrawlController class

This class is only for extending the getNoSQLConnection method needed by some components to return a dummy NoSQLConnection object.

StandaloneCrawlController class
  private static class StandaloneCrawlController extends CrawlControllerImpl {

    RepositoryAccessProvider rap;
    public StandaloneCrawlController(RepositoryAccessProvider rap) {
      this.rap = rap;

    public RepositoryAccessProvider getRAP() {
      return rap;

    public NoSQLConnection getNoSQLConnection(String name, AspireObject properties) {
      return new EmptyNoSQLConnection();

EmptyNoSQLConnection class

As the name suggests, this is a dummy NoSQLConnection that doesn't t really do anything.

public class EmptyNoSQLConnection implements NoSQLConnection {

      public String getDatabase() {
        return null;

      public String getCollection() {
        return null;

      public void add(AspireObject item) throws AspireException {

      public void update(AspireObject item, String id) throws AspireException {

      public void update(AspireObject item, AspireObject filter)
          throws AspireException {

      public void updateAll(AspireObject item, AspireObject filter)
          throws AspireException {

      public void updateOrAdd(AspireObject item, AspireObject filter)
          throws AspireException {

      public boolean delete(String id) throws AspireException {
        return false;

      public boolean delete(AspireObject filter) throws AspireException {
        return false;

      public boolean deleteAll(AspireObject filter) throws AspireException {
        return false;

      public AspireObject getOneAndUpdate(AspireObject filter,
          AspireObject update) throws AspireException {
        return null;

      public AspireObject getOne(AspireObject filter) throws AspireException {
        return null;

      public NoSQLIterable<AspireObject> getAll(AspireObject filter)
          throws AspireException {
        return null;

      public NoSQLIterable<AspireObject> getAll(AspireObject filter, int skip)
          throws AspireException {
        return null;

      public NoSQLIterable<AspireObject> getAll() throws AspireException {
        return null;

      public NoSQLIterable<AspireObject> getAll(int skip)
          throws AspireException {
        return null;

      public long size() throws AspireException {
        return 0;

      public long size(AspireObject filter) throws AspireException {
        return 0;

      public void clear() throws AspireException {

      public void close() throws AspireException {

      public AspireObject getAspireObject(Object obj) throws AspireException {
        return null;

      public void flush() {

      public void setBulkTimeout(long timeout) {

      public void setBulkSize(int size) {

      public void useBulk(boolean useBulk) {

      public AspireObject getOneAndUpdateOrAdd(AspireObject update,
          AspireObject filter) throws AspireException {
        return null;

      public AspireObject getOneAndDelete(AspireObject filter)
          throws AspireException {
        return null;

For Legacy connector standalone crawls see Connector Scanner Stage Test Harness.