This feature allows jobs to be sent to remote Aspire Distributions (called remote nodes). This can increase performance when there are high resource consuming pipelines, by load balancing the work across all available remote nodes.
Communications between remote nodes will be tightly coupled, meaning that they must be on the same intranet, geographically on the same place and (recommended) with no firewalls and other security mechanisms between nodes on the cluster.
On this page:
Table of Contents |
---|
Since the 4.0 release, Aspire connectors are able to crawl in distributed mode automatically. Since all the crawl control data is stored in MongoDB, by just adding more Aspire servers configured to use the same MongoDB, the common connectors are going to crawl distributively.
On this page:
Table of Contents |
---|
In order to setup an Aspire Cluster for Distributed Processing, you need to configure each Aspire server to use the same MongoDB instance:
Excerpt Include | ||||
---|---|---|---|---|
|
If you need to connect to a multi node MongoDB installation, check: Connect to a Multi-node MongoDB Installation
Once you have configured each instance, you need to
A Discovery Manager is a component that handles different methods of discovery for remote nodes. There are three different Discovery Managers: Default, Amazon EC2 and ZooKeeper.
This Discovery Manager has the basic functionality for remote nodes and resource discovery.
Example configuration:
<discoveryManager type="default">
</discoveryManager>
One of the following discovery methods must be configured inside the <discoveryManager type="default">
tag.
Useful for debugging or in well known cluster setups (with static IP addresses configurations). Reads a list of remote nodes and registers them for remote branching.
Each node is identified by its IP Address and its distributed communications port.
Available options are:
Example configuration:
<discoveryManager type="default">
<discovery type="static">
<checkTimeout>45000</checkTimeout>
<remoteNodes>
<remoteNode portNumber="51510">10.10.30.122</remoteNode>
<remoteNode portNumber="51515">10.10.20.139</remoteNode>
</remoteNodes>
</discovery>
</discoveryManager>
You must specify each remote node IP Address and port as shown above. Notice that the port number corresponds to the distributed communications port, not the Aspire ordinary HTTP port.
Intercepts discovery messages sent by other nodes. If the incoming message is from a new node, it is registered. Otherwise, that node information is updated.
You must enable this discovery method if you want a node to broadcast information about itself.
Available options are:
Element | Type | Default | Description |
---|---|---|---|
broadcastPort | int | none | (required) Port used to listen for other nodes messages and to broadcast information about the current node. |
multicastAddressGroup | none | Multicast address group used to listen and send broadcast messages. This group must be the same on all nodes on the same cluster. |
Example configuration:
<discoveryManager type="default">
<discovery type="broadcast">
<broadcastPort>10324</broadcastPort>
<multicastAddressGroup>230.0.0.1</multicastAddressGroup>
</discovery>
</discoveryManager>
<distributedCommunications enabled="true">
<checkpointJobRequests>true</checkpointJobRequests>
<connectionIdleTimeout>120000</connectionIdleTimeout>
<port>51510</port>
<pollTimeout>100</pollTimeout>
<tcp>
<keepAlive>false</keepAlive>
<trafficClass>2</trafficClass>
<reuseAddress>false</reuseAddress>
<readTimeout>10000</readTimeout>
<tcpNoDelay>false</tcpNoDelay>
</tcp>
<discoveryManager type="default">
<discovery type="static">
<checkTimeout>45000</checkTimeout>
<remoteNodes>
<remoteNode portNumber="51515">192.168.0.122</remoteNode>
<remoteNode portNumber="51515">10.10.20.139</remoteNode>
</remoteNodes>
</discovery>
<discovery type="broadcast">
<broadcastPort>10324</broadcastPort>
<multicastAddressGroup>230.0.0.1</multicastAddressGroup>
</discovery>
</discoveryManager>
</distributedCommunications>
Since you cannot use the broadcast discovery method at Amazon Elastic Compute Cloud because of network restrictions, you can use the Amazon EC2 Discovery Manager for dynamic discovering of remote nodes.
Example configuration:
<discoveryManager type="amazonec2">
<implementation>com.searchtechnologies.aspire:aspire-amazonec2-dm</implementation>
</discoveryManager>
Example configuration:
<discoveryManager type="amazonec2">
<implementation>com.searchtechnologies.aspire:aspire-amazonec2-dm</implementation>
<discovery type="amazonec2">
<accessKey>encrypted:ENCRYPTED_ACCESS_KEY</accessKey>
<secretKey>encrypted:ENCRYPTED_SECRET_KEY</secretKey>
</discovery>
</discoveryManager>
Advanced configuration:
<discoveryManager type="amazonec2">
<implementation>com.searchtechnologies.aspire:aspire-amazonec2-dm</implementation>
<discovery type="amazonec2">
<accessKey>encrypted:ENCRYPTED_ACCESS_KEY</accessKey>
<secretKey>encrypted:ENCRYPTED_SECRET_KEY</secretKey>
<usePublicIP>false</usePublicIP>
<securityGroup>MySecurityGroup</securityGroup>
<pollFrequency>1000</pollFrequency>
</discovery>
</discoveryManager>
This discovery manager uses zookeeper as a centralized site for discovering remote nodes and their resources as well. Click here for details about Zookeeper installation and configuration.
Example configuration:
<discoveryManager type="zookeeper">
<implementation>com.searchtechnologies.aspire:aspire-zk-dm</implementation>
<zookeeperConnection>127.0.0.1:2182,127.0.0.1:2183,127.0.0.1:2181</zookeeperConnection>
</discoveryManager>
Advanced configuration:
<discoveryManager type="zookeeper">
<implementation>com.searchtechnologies.aspire:aspire-zk-dm</implementation>
<zookeeperConnection>127.0.0.1:2182,127.0.0.1:2183,127.0.0.1:2181</zookeeperConnection>
<zookeeperPath>/aspire/nodes</zookeeperPath>
<zookeeperTimeout>3000</zookeeperTimeout>
<resourceUpdateTime>2000</resourceUpdateTime>
</discoveryManager>