Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

Aspire as a content processing framework is capable of coordinating work with other Aspire servers in order to balance the resource utilization and provide high availability. 

This section describes how does Aspire work in a distributed environment and how to configure it.

When configured to work in a cluster, Aspire interacts with two different external systems simultaneously:

  1. NoSQL database (Mongo, Elasticsearch, HBase) for crawl state
    • Handles the distribution of documents across the Aspire servers during crawls.
    • Handles incremental information
  2. Zookeeper
    • Configuration synchronization across the cluster
      • Content sources, sevices, workflow configuration, other config files (groovy transformations, xsl, etc).
    • Enforces scheduled crawl atomicity (no more than one crawl per content source at any given time)
    • Detects when a server in the cluster is no longer available, and notify the rest of the servers so they can to resume any outstanding work from that dead server.

All servers are equal in an Aspire cluster, this means there is no "master" server.






Prerequisites

  1. Configure a ZooKeeper cluster (version 3.5.5)
    1. ZooKeeper standalone configuration
    2. ZooKeeper cluster configuration (Recommended for Production)

  2. Make sure the NoSQL provider settings are properly configured to a database instance (or database cluster) accesible for all servers:
    1. MongoDB Provider Settings
    2. HBase Provider Settings
    3. Elasticsearch Provider Settings

Configuration

By default the Aspire distributions are configured to work in standalone mode:

Code Block
languagexml
  <configAdministration>
    <zookeeper enabled="false" root="/aspire">
    <!-- <externalServer>127.0.0.1:2181,127.0.0.1:2181,127.0.0.1:2181</externalServer> -->
  </configAdministration>

Let's configure our Aspire cluster, in each Aspire server modify the config/settings.xml:

  1. So our first step would be to enable zookeeper and point to our zookeeper cluster. (Let's assume we have a three server zookeeper cluster zooA.dev.com, zooB.dev.com and zooC.dev.com)

    Code Block
    languagexml
      <configAdministration>
        <zookeeper enabled="true" root="/aspire">
        <externalServer>zooA.dev.com:2182,zooB.dev.com:2183,zooC.dev.com:2181</externalServer>
      </configAdministration>
  2. If we were to start our Aspire servers now, they would not interact with each other as a cluster, this is because they do not share the same cluster ID. So let's define our clusterID to be "dev", by uncommenting the <clusterID> field:

    Code Block
    languagexml
      <!-- By default all Aspire servers start in their own cluster. To make servers work together, set a common
           cluster id across multiple instances that are connected to a common zooKeeper instance and database
           provider (for example "dev" or "prod") -->
      <clusterId>dev</clusterId>

Starting the cluster

  1. Copy the License file into one of the Aspire servers (under config/license)
  2. From that server execute

    Code Block
    languagebash
    $ bin/pushLicense.sh
    Settings File: /aspire/config/settings.xml
    License File: /aspire/config/license/AspireLicense.lic
    License copied to /aspire/dev/license/AspireLicense.lic
    Info

    This will upload the license into the ZooKeeper cluster, so all new servers connecting to it will automatically download it.

  3. Execute the startup script on all servers (aspire.bat or aspire.sh)
  4. Browse to http://<aspireserver>:50505/aspire  and verify the health to be "green"



  5. Once all servers are up, choose any server and browse to http://<aspireserver>:50505/aspire/admin/ui/files/#/server
    1. Verify you can see all your servers



  6. (Optional verification) Add an Aspider Web Crawler content source, point it to any web site you like
    1. Click on save
    2. Wait until the content source is loaded
    3. Verify you can see the same content source in the other servers.

From now on, any configuration done on one server will be synchronized into the others.

FAQ

Just connected my distribution to ZooKeeper for the first time, and all my content sources and configuration is gone

As we mentioned above, when starting in cluster mode, Aspire pulls the configuration from ZooKeeper, this means it discards any local configuration it has. If you were using Aspire in standalone mode, then all your configuration was locally stored in the config folder of your distribution, so when you connected Aspire for the first time to ZooKeeper it discarded it all in favor of pulling the configuration from ZooKeeper (which has nothing)

But don't worry, every time Aspire starts, it creates a backup of the system configuration and stores it under config/backups: once you have started your distribution with ZooKeeper you can import the backup and restore your system configuration!

                                      


How do I upload a new license to my cluster?


If you need to upload the license again, use the -force argument like this:

Code Block
languagebash
$ bin/pushLicense.sh -force

Then in order to force all servers to download the new license, delete all files under config/license in all servers