Since the 3.1 release, Aspire connectors are able to crawl in distributed mode automatically. Since all the crawl control data is stored in MongoDB, by just adding more Aspire servers configured to use the same MongoDB, the common connectors are going to crawl distributively.
Each connector is responsible for talking to the repositories, scanning through all the items to fetch and store its IDs to MongoDB for being processed later by any other server or itself.
On this page:
In order to setup an Aspire Cluster for Distributed Processing, you need to do the following steps:
You need to configure all Aspire servers to use the same MongoDB Installation, configure all the Aspire Servers config/settings.xml file
<!-- noSql database provider for the 3.1 connector framework --> <noSQLConnectionProvider connectionsPerHost="10" sslEnabled="false" sslInvalidHostNameAllowed="false"> <implementation>com.searchtechnologies.aspire:aspire-mongodb-provider</implementation> <dropOnClear>false</dropOnClear> <servers>mongodb-host:27017</servers> </noSQLConnecitonProvider>
Remember to replicate the changes done to the content sources to the rest of servers, otherwise the changes will only if the crawl is started from the server you did the change. If you don't want to do the changes replication manually, use Failover for Aspire using Zookeeper to handle the content source configuration replication among the servers.
Controlling distributed processing is very simple, all you need to know is that if you start the crawl from any of the Aspire Servers, the crawl will start from all the servers, the same applies if you pause, stop or resume a crawl.
You are safe to shutdown all aspire servers when you do a Pause.
If you need to shutdown one or more servers for maintenance:
If you need to shutdown all the servers in the cluster, it is mandatory to pause the crawl first.