Content sources for some repositories need to be throttled to prevent “overrunning” a source, making it unavailable to the end user. The manager nodes handle all allocation of work, including throttling.

Allocation of seeds to managers

A seed (and all the documents it contains) is managed by only one manager node. The manager groups items to be processed into batches and when a worker needs more to process it requests a batch from the manager. To allow for throttling of requests to a content server, all seeds for that server are managed on a single manager node. This allows us to avoid having to throttle in a distributed manner. If a content server needs the connector to be throttled, the manager node can slow the rate at which batches of items are released to the worker. A manager node is responsible for the seeds actively being crawled. Where there is more than one manager, a seed (and all the documents it contains) is the sole responsibility of one manager. Seeds that are controlled by the same throttle policy are assigned to the same manager node, so these seeds can be throttled together. Allocation of seeds to managers are performed at the time a crawl starts. Where there is more than one manager in a system, responsibility for seeds is distributed across the managers as evenly as possible, within the constraints imposed by the throttle policies.

When a seed starts, it is allocated to a manager. The main manager maintains a list of those allocations. If the starting seed had a throttle policy, and any in progress crawl has the same policy, the main manager will allocate the seed to the same manager as the previous seed (to maintain the assertion that seeds with the same throttle policy run on the same manager). If the seed does not have a throttle policy, or no other seed with the same policy is running, then the main manager chooses a manager, trying to balance the number of seeds across managers.

How throttling works

The manager counts the number of documents added to the memory queue for each throttle ID. Should the throttle rate be reached, the manager will simply stop adding batches for this throttle ID to the memory queue. This will cause the queues for that throttle ID to empty and reduce the rate of processing. Throttling information will only be held in memory. If a manager fails, the rate should be at or below the throttle, meaning that a new manager can afford to start without reference to the previous throttle.

Throttling policies

Throttling is performed by the manager nodes. When seeds are allocated to managers, the algorithm will take account of the throttle policy, using it to group the seeds to a single manager. When creating batches to be allocated to workers, the manager will track the number of items allocated over time.

A throttle policy indicates a number of items that can be processed over time. These items can be related to documents. The policy will indicate the number of items and the period. If the number of items allocated by the manager in the given period exceeds the threshold, the manager will not allocate items.

Throttling policies can be assigned to:

First priority is given to a throttle policy assigned to the seed, second priority goes to the connection, and last to the credential.