One of the manager nodes is designated as the main manager. This manager is responsible for:
When a manager starts, if a main manager is already running, the manager will assume a “non-main” role.
A single main manager node must always exist, and so when a manager starts, or when a manager node fails, the managers co-ordinate to ensure that an active main manager exists. On start-up, a manager will check that an active main manager exists. If it does, then the manager simply joins the cluster. If an active main manager does not exist, the managers collaborate to choose a new oldest manager is elected as main manager.
Node failover is split into three areas – identification of a failed main manager, identification of failed managers and identification of failed workers. Monitoring is done by examining the node heartbeat entry in the settings NoSQL database and “detecting” a failure when that heartbeat is out of date by more than a given period. Any clean-up work (setting in progress items to be available for instance) will be executed in the main manager.
All non-main manager nodes monitor the main manager for a failure and try to become the main manager if that is the case. Once the new main manager has been “elected”, its operation will change to reflect the new role.
Only the main manager monitors for failed manager nodes. If a manager fails, the main manager does the following:
Only the main manager node monitors for failed worker nodes. If a worker node is detected as failed, the main manager does the following: