Introduction

Aspire as a content processing framework is capable of coordinating work with other Aspire servers in order to balance the resource utilization and provide high availability. 

This section describes how does Aspire work in a distributed environment and how to configure it.

When configured to work in a cluster, Aspire interacts with two different external systems simultaneously:

  1. NoSQL database (Mongo, Elasticsearch, HBase) for crawl state
    • Handles the distribution of documents across the Aspire servers during crawls.
    • Handles incremental information
  2. Zookeeper
    • Configuration synchronization across the cluster
      • Content sources, sevices, workflow configuration, other config files (groovy transformations, xsl, etc).
    • Enforces scheduled crawl atomicity (no more than one crawl per content source at any given time)
    • Detects when a server in the cluster is no longer available, and notify the rest of the servers to resume any outstanding work from that dead server.

All servers are equal in an Aspire cluster, this means there is no "master" server.






Prerequisites

  1. Configure a ZooKeeper cluster (version 3.4.14)
    1. ZooKeeper standalone configuration
    2. ZooKeeper cluster configuration (Recommended for Production)

  2. Make sure the NoSQL provider settings are properly configured to a database instance (or database cluster) accesible for all servers:
    1. MongoDB Provider Settings
    2. HBase Provider Settings
    3. Elasticsearch Provider Settings

Configuration

By default the Aspire distributions are configured to work in standalone mode:

  <configAdministration>
    <zookeeper enabled="false" root="/aspire">
    <!-- <externalServer>127.0.0.1:2181,127.0.0.1:2181,127.0.0.1:2181</externalServer> -->
  </configAdministration>

Let's configure our Aspire cluster, in each Aspire server modify the config/settings.xml:

  1. So our first step would be to enable zookeeper and point to our zookeeper cluster. (Let's assume we have a three server zookeeper cluster zooA.dev.com, zooB.dev.com and zooC.dev.com)

      <configAdministration>
        <zookeeper enabled="true" root="/aspire">
        <externalServer>zooA.dev.com:2182,zooB.dev.com:2183,zooC.dev.com:2181</externalServer>
      </configAdministration>
  2. Its recommended that you name each server with some unique id, otherwise Aspire will set its unique ID as the IP address.


      <configAdministration>
        <zookeeper enabled="true" root="/aspire">
        <externalServer>zooA.dev.com:2182,zooB.dev.com:2183,zooC.dev.com:2181</externalServer>
        <serverId>aspireA</serverId>
      </configAdministration>
  3. If we were to start our Aspire servers now, they would not interact with each other as a cluster, this is because they do not share the same cluster ID. So let's define our clusterID to be "dev", by uncommenting the <clusterID> field:

      <!-- By default all Aspire servers start in their own cluster. To make servers work together, set a common
           cluster id across multiple instances that are connected to a common zooKeeper instance and database
           provider (for example "dev" or "prod") -->
      <clusterId>dev</clusterId>

Starting the cluster

  1. Copy the License file into one of the Aspire servers (under config/license)
  2. From that server execute

    $ bin/pushLicense.sh
    Settings File: /aspire/config/settings.xml
    License File: /aspire/config/license/AspireLicense.lic
    License copied to /aspire/dev/license/AspireLicense.lic

    This will upload the license into the ZooKeeper cluster, so all new servers connecting to it will automatically download it.

  3. Execute the startup script on all servers (aspire.bat or aspire.sh)
  4. Browse to http://<aspireserver>:50505/aspire  and verify the health to be "green"



  5. Once all servers are up, choose any server and browse to http://<aspireserver>:50505/aspire/admin/ui/files/#/server
    1. Verify you can see all your servers



  6. (Optional verification) Add an Aspider Web Crawler content source, point it to any web site you like
    1. Click on save
    2. Wait until the content source is loaded
    3. Verify you can see the same content source in the other servers.

From now on, any configuration done on one server will be synchronized into the others.

Advanced properties

PropertyDefaultDescription
sharedFolderconfig/sharedPath where the shared folder is located. This folder is used to share configuration files between servers in the cluster. (seed files, mappings, transformation files, etc.).
connectionTimeout3000Connection timeout (in ms) is how long to wait for a server response.
sessionTimeout6000The session timeout (in ms) to negotiate with the zookeeper servers.
syncTime10000

How often the non cluster mode should check for updates.

In cluster mode, this is only used for updates frequency for the sharedFolder.

releaseTimeoutSeconds30Time (in seconds) to wait for a server to become available after leaving the cluster. If the server detected to be down, does not appear as online after this timeout, all its resources will be "released" from the NoSQL database.
maxRetries50Max retries for the zookeeper connections
maxBackoffSleepTime5000Maximum time (in ms) to wait between each retry. Exponential increases will be applied for successive failures, starting at 500 ms.

FAQ

Just connected my distribution to ZooKeeper for the first time, and all my content sources and configuration is gone

As we mentioned above, when starting in cluster mode, Aspire pulls the configuration from ZooKeeper, this means it discards any local configuration it has. If you were using Aspire in standalone mode, then all your configuration was locally stored in the config folder of your distribution, so when you connected Aspire for the first time to ZooKeeper it discarded it all in favor of pulling the configuration from ZooKeeper (which has nothing)

But don't worry, every time Aspire starts, it creates a backup of the system configuration and stores it under config/backups: once you have started your distribution with ZooKeeper you can import the backup and restore your system configuration!

                                      


How do I upload a new license to my cluster?


If you need to upload the license again, use the -force argument like this:

$ bin/pushLicense.sh -force

Then in order to force all servers to download the new license, delete all files under config/license in all servers


  • No labels