Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

Aspire as a content processing framework is capable of coordinating work with other Aspire servers in order to balance the resource utilization and provide high availability. 

This section describes how does Aspire work in a distributed environment and how to configure it.

When configured to work in a cluster, Aspire interacts with two different external systems simultaneously:

  1. NoSQL database (Mongo, Elasticsearch, HBase) for crawl state
    • Handles the distribution of documents across the Aspire servers during crawls.
    • Handles incremental information
  2. Zookeeper
    • Configuration synchronization across the cluster
      • Content sources, sevices, workflow configuration, other config files (groovy transformations, xsl, etc).
    • Enforces scheduled crawl atomicity (no more than one crawl per content source at any given time)
    • Detects when a server in the cluster is no longer available, and notify the rest of the servers so they can resume any outstanding work from that server.

All servers are equal in an Aspire cluster, this means there is no "master" server.






Prerequisites

  1. Configure a ZooKeeper cluster (version 3.5.5)
    1. ZooKeeper standalone configuration
    2. ZooKeeper cluster configuration (Recommended for Production)

  2. Make sure the NoSQL provider settings are properly configured to a database instance (or database cluster) accesible for all servers:
    1. MongoDB Provider Settings
    2. HBase Provider Settings
    3. Elasticsearch Provider Settings

Configuration

By default the Aspire distributions are configured to work in standalone mode:

Code Block
languagexml
  <configAdministration>
    <zookeeper enabled="false" root="/aspire">
    <!-- <externalServer>127.0.0.1:2182,127.0.0.1:2183,127.0.0.1:2181</externalServer> -->
  </configAdministration>


  1. So our first step would be to enable zookeeper and point to our zookeeper cluster. (Let's assume we have a three server zookeeper cluster zooA.dev.com, zooB.dev.com and zooC.dev.com)

    Code Block
    languagexml
      <configAdministration>
        <zookeeper enabled="true" root="/aspire">
        <externalServer>zooA.dev.com:2182,zooB.dev.com:2183,zooC.dev.com:2181</externalServer>
      </configAdministration>
  2. If we were to start our Aspire servers now, they would not interact with each other as a cluster, this is because they do not share the same cluster ID. So let's define our clusterID to be "dev", by uncommenting the <clusterID> field:

    Code Block
    languagexml
      <!-- By default all Aspire servers start in their own cluster. To make servers work together, set a common
           cluster id across multiple instances that are connected to a common zooKeeper instance and database
           provider (for example "dev" or "prod") -->
      <clusterId>dev</clusterId>

Starting the cluster

  1. Copy the License file into one of the Aspire servers (under config/license)
  2. From that server execute

    Code Block
    languagebash
    $ pushLicense.sh
    2019-06-03T18:10:35Z INFO [BOOTLOADER]: Connection state is CONNECTED
    2019-06-03T18:10:36Z INFO [BOOTLOADER]: License copied to /aspire/dev/license/AspireLicense.lic
    Info

    Now any new Aspire server connecting to this ZooKeeper cluster with the same cluster ID will use the same license, so you don't need to copy it into the server.

  3. Execute the startup script on all servers (aspire.bat or aspire.sh)
  4. Browse to http://<aspireserver>:50505/aspire  and verify the health to be "green"
    Image Added
  5. Once all servers are up, choose any server and browse to http://<aspireserver>:50505/aspire/admin/ui/files/#/server
    1. Verify you can see all your servers
  6. (Optional verification) Add an Aspider Web Crawler content source, point it to any web site you like
    1. Click on save
    2. Wait until the content source is loaded
    3. Verify you can see the same content source in the other servers.

From now on, any configuration done on one server will be synchronized into the others.