Worker nodes are responsable of item processing. These nodes hold in memory queues of batches that are requested to the managers. Separate queues are used for batches containing items to process versus items to scan.

A thread monitors the queue periodically and if the queue size is below a certain threshold, the worker will request further batches. The worker will process each batch and set each item in the batch as “in progress” in NoSQL to confirm its receipt. Once all items have been marked, the worker will send an “acknowledge batch” to the manager. This will cause the manager to remove the batch from its “ready batches” queue. Requests for batches will be directed round robin across all active managers. If a request fails, the worker will move on to the next manager, assuming the error was transient, or that the manager has failed and will be marked as such by the main manager.


Configuration


The worker nodes can be configured setting environment variables or JVM properties or using the settings json that is uploaded to the NoSQL database.


Loading Components


Seeds can share connectors and workflows. For worker nodes this means a single instance of a given connector or workflow is loaded for multiple seeds. These components are loaded on demand as soon as a worker node needs to process an item of a given seed, and kept in memory while in use.

When items are being processed and the assigned connector or workflow is not loaded, the worker will automatically load the components before processing any item. This components are kept in a list of loaded components and will be removed after a certain idle period. Configuration changes on these components are detected by the worker using a checksum and checking the current loaded configuration is still valid when a new crawl for a seed is detected. Configuration changes while a seed is running are not allowed by the api.


Batching


Batching is now done at the publisher level inside the workflow. This allows to have batches with items from different seeds that have a shared destination. When a batch fails, it will report the failure back to all the jobs it contained and, depending on the connector configuration, will mark the job appropriately (either mark it as error, completed or delete it from the queue).

  • No labels