You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Content sources for some repositories need to be throttled to prevent “overrunning” a source, making it unavailable to the end user. The manager nodes handle all allocation of work including throttling.

Allocation of seeds to managers

A seed (and all the documents it contains) is managed by only one manager node. The manager groups items to be processed into batches and when a worker needs more to process it requests a batch from the manager. To allow for throttling of requests to a content server, all seeds for that server is managed on a single manager node. This allows us to avoid having to throttle in a distributed manner. If a content server needs the connector to be throttled, the manager node is able to slow the rate at which batches of items are released to the worker. A manager node is responsible for the seeds actively being crawled. Where there is more than one manager, a seed (and all the documents it contains) is the sole responsibility of one manager. Seeds that are controlled by the same throttle policy are assigned to same manager node, so these seeds can be throttled together. Allocation of seeds to managers are performed at the time a crawl starts. Where there is more than one manager in a system, responsibility for seeds is distributed across the managers as evenly as possible, within the constraints imposed by the throttle policies.

When a seed starts, it is allocated to a manager. The main manager maintains a list of those allocations. If the starting seed had a throttle policy, and any in progress crawl has the same policy, the main manager will allocate the seed to the same manager as the previous seed (to maintain the assertion that seeds with the same throttle policy run on the same manager). If the seed does not have a throttle policy, or no other seed with the same policy is running, then the main manager chooses a manager, trying to balance the number of seeds across managers.

How throttling works

The manager counts the number of documents added to the memory queue for each “throttle id”. Should the throttle rate be reached, the manager will simply stop adding batches for this throttle id to the memory queue. This will cause the queues for that throttle id to empty and reduce the rate of processing. Throttling information will only be held in memory. If a manager fails, the rate should be at or below the throttle, meaning that a new manager can afford to start without reference to the previous throttle.

Throttling policies

Throttling is performed by the manager nodes. When seeds are allocated to managers, the algorithm will take account of the throttle policy, using it to group the seeds to a single manager. When creating batches to be allocated to workers, the manager will track the number of items allocated over time.

A throttle policy indicates a number of items that can be processed over time. These items may be documents or API calls. The policy will indicate the number and type of items and the period. If the number of items allocated by the manager in the given period exceeds the threshold, the manager will not allocate items.

The connector may indicate an “API cost” of a document. For example, a call to retrieve a document from a repository may take 5 API calls. The throttling threshold may be set at 10 API calls per second. Since the manager allocates documents, it will multiply the “API cost” by the number of documents to decide if the throttle should be applied.

(We do not support API calls yet in the current Aspire version)



  • No labels