Introduction

Aspire as a content processing framework is capable of coordinating work with other Aspire servers in order to balance the resource utilization and provide high availability.

This section describes how does Aspire work in a distributed environment and how to configure it.

When configured to work in a cluster, Aspire interacts with two different external systems simultaneously:

NoSQL database (Mongo, Elasticsearch, HBase) for crawl state
- Handles the distribution of documents across the Aspire servers during crawls.
- Handles incremental information
Zookeeper
- Configuration synchronization across the cluster
  - Content sources, sevices, workflow configuration, other config files (groovy transformations, xsl, etc).
- Enforces scheduled crawl atomicity (no more than one crawl per content source at any given time)
- Detects when a server in the cluster is no longer available, and notify the rest of the servers so they can resume any outstanding work from that server.

All servers are equal in an Aspire cluster, this means there is no "master" server.

Configuration

By default the Aspire distributions are configured to work in standalone mode:

  <configAdministration>
    <zookeeper enabled="false" root="/aspire">
    <!-- <externalServer>127.0.0.1:2182,127.0.0.1:2183,127.0.0.1:2181</externalServer> -->
  </configAdministration>

So our first step would be to enable zookeeper and point to our zookeeper cluster. (Let's assume we have a three server zookeeper cluster zooA.dev.com, zooB.dev.com and zooC.dev.com)

  <configAdministration>
    <zookeeper enabled="true" root="/aspire">
    <externalServer>zooA.dev.com:2182,zooB.dev.com:2183,zooC.dev.com:2181</externalServer>
  </configAdministration>

Page tree

Distributed Aspire & Failover

Introduction

Configuration