You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Introduction

Aspire as a content processing framework is capable of coordinating work with other Aspire servers in order to balance the resource utilization and provide high availability. 

This section describes how does Aspire work in a distributed environment and how to configure it.

When configured to work in a cluster, Aspire interacts with two different external systems simultaneously:

  1. NoSQL database (Mongo, Elasticsearch, HBase) for crawl state
    • Handles the distribution of documents across the Aspire servers during crawls.
    • Handles incremental information
  2. Zookeeper
    • Configuration synchronization across the cluster
      • Content sources, sevices, workflow configuration, other config files (groovy transformations, xsl, etc).
    • Enforces scheduled crawl atomicity (no more than one crawl per content source at any given time)
    • Detects when a server in the cluster is no longer available, and notify the rest of the servers so they can resume any outstanding work from that server.


All servers are equal in an Aspire cluster, this means there is no "master" server.



Prerequisites

  1. Configure a ZooKeeper cluster (version 3.5.5)
    1. ZooKeeper standalone configuration
    2. ZooKeeper cluster configuration (Recommended for Production)

  2. Make sure the NoSQL provider settings are properly configured to a database instance (or database cluster) accesible for all servers:
    1. MongoDB Provider Settings
    2. HBase Provider Settings
    3. Elasticsearch Provider Settings

Configuration

By default the Aspire distributions are configured to work in standalone mode:

  <configAdministration>
    <zookeeper enabled="false" root="/aspire">
    <!-- <externalServer>127.0.0.1:2182,127.0.0.1:2183,127.0.0.1:2181</externalServer> -->
  </configAdministration>


  1. So our first step would be to enable zookeeper and point to our zookeeper cluster. (Let's assume we have a three server zookeeper cluster zooA.dev.com, zooB.dev.com and zooC.dev.com)

      <configAdministration>
        <zookeeper enabled="true" root="/aspire">
        <externalServer>zooA.dev.com:2182,zooB.dev.com:2183,zooC.dev.com:2181</externalServer>
      </configAdministration>
  2. If we were to start our Aspire servers now, they would not interact with each other as a cluster, this is because they do not share the same cluster ID. So let's define our clusterID to be "dev", by uncommenting the <clusterID> field:

      <!-- By default all Aspire servers start in their own cluster. To make servers work together, set a common
           cluster id across multiple instances that are connected to a common zooKeeper instance and database
           provider (for example "dev" or "prod") -->
      <clusterId>dev</clusterId>
  • No labels