Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Easy Heading Free
navigationTitleOn this Page
wrapNavigationTexttrue
navigationExpandOptionexpand-all-by-default

Content sources for some repositories need to be throttled to prevent “overrunning” a source, making it unavailable to the end user. The manager nodes handle all allocation of work, including throttling.

Allocation of seeds to managers

A seed (and all the documents it contains) is managed by only one manager node. The manager groups items to be processed into batches and when a worker needs more to process it requests a batch from the manager. To allow for throttling of requests to a content server, all seeds for that server is are managed on a single manager node. This allows us to avoid having to throttle in a distributed manner. If a content server needs the connector to be throttled, the manager node is able to can slow the rate at which batches of items are released to the worker. A manager node is responsible for the seeds actively being crawled. Where there is more than one manager, a seed (and all the documents it contains) is the sole responsibility of one manager. Seeds that are controlled by the same throttle policy are assigned to the same manager node, so these seeds can be throttled together. Allocation of seeds to managers are performed at the time a crawl starts. Where there is more than one manager in a system, responsibility for seeds is distributed across the managers as evenly as possible, within the constraints imposed by the throttle policies.

...

The manager counts the number of documents added to the memory queue for each “throttle id”throttle ID. Should the throttle rate be reached, the manager will simply stop adding batches for this throttle id ID to the memory queue. This will cause the queues for that throttle id ID to empty and reduce the rate of processing. Throttling information will only be held in memory. If a manager fails, the rate should be at or below the throttle, meaning that a new manager can afford to start without reference to the previous throttle.

...

A throttle policy indicates a number of items that can be processed over time. These items may can be related to documents or API calls. The policy will indicate the number and type of items and the period. If the number of items allocated by the manager in the given period exceeds the threshold, the manager will not allocate items.

The connector may indicate an “API cost” of a document. For example, a call to retrieve a document from a repository may take 5 API calls. The throttling threshold may be set at 10 API calls per second. Since the manager allocates documents, it will multiply the “API cost” by the number of documents to decide if the throttle should be applied.

Throttling policies can be assigned to:

  • seed
  • connection
  • credential

First priority is given to a throttle policy assigned to the seed, second priority goes to the connection, and last to the credential.(We do not support API calls yet in the current Aspire version)