Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Content sources for some repositories need to be throttled to prevent “overrunning” a source, making it unavailable to the end user. The manager nodes handle all allocation of work, including throttling.

Allocation of seeds to managers

A seed (and all the documents it contains) is managed by only one manager node. The manager groups items to be processed into batches and when a worker needs more to process it requests a batch from the manager. To allow for throttling of requests to a content server, all seeds for that server is are managed on a single manager node. This allows us to avoid having to throttle in a distributed manner. If a content server needs the connector to be throttled, the manager node is able to can slow the rate at which batches of items are released to the worker. A manager node is responsible for the seeds actively being crawled. Where there is more than one manager, a seed (and all the documents it contains) is the sole responsibility of one manager. Seeds that are controlled by the same throttle policy are assigned to the same manager node, so these seeds can be throttled together. Allocation of seeds to managers are performed at the time a crawl starts. Where there is more than one manager in a system, responsibility for seeds is distributed across the managers as evenly as possible, within the constraints imposed by the throttle policies.

...

The manager counts the number of documents added to the memory queue for each “throttle id”throttle ID. Should the throttle rate be reached, the manager will simply stop adding batches for this throttle id ID to the memory queue. This will cause the queues for that throttle id ID to empty and reduce the rate of processing. Throttling information will only be held in memory. If a manager fails, the rate should be at or below the throttle, meaning that a new manager can afford to start without reference to the previous throttle.

...

  • seed
  • connection
  • credential

First priority goes is given to a throttle policy assigned to the seed, second priority goes to the connection, and last to the credential.

...