The crawl status is controlled by the manager to which the crawl is assigned. This is done to prevent any issues that could show up due to synchronization between nodes. The crawl status is split into two properties

  • Crawl Phase
  • Crawl State

To see the multiple crawl control endpoints, check the Seeds API documentantion.

Crawl Phase


The crawl phase indicates the action the crawl is currently performing. Each crawl will will transition to the following phase if a set of conditions is met. Not all crawl types, transition through all phases. These are the phases for each crawl type:

  • Full Crawl:
    1. Idle
    2. Crawl Start
    3. Crawl
    4. Post-Reprocess (if enabled)
    5. Crawl End
  • Incremental Crawl:
    1. Idle
    2. Crawl Start
    3. Pre-Reprocess (if enabled)
    4. Crawl
    5. Deletes
    6. Post-Reprocess (if enabled)
    7. Crawl End
  • Identity Crawl:
    1. Idle
    2. Crawl Start
    3. Crawl
    4. Crawl End










Crawl State


Each crawl phase has a set of possible states. The allowable states for a given phase will be a subset of all the possible states, with some states not being allowed in a given phase (as they do not make sense – “pausing” in the idle phase for instance). The diagram shows the crawl phases and the allowable states.


Crawl Phase and State Transitions


When a seed crawl enters a phase, the state is always "initializing". The manager in charge of monitoring the seed state will then create a "control item", add it to the queue and set the state to "running". Any worker node will get the control item and process it. At this point the manager will transition the crawl to the next phase until the "control item" job is done and there are no pending jobs in the queue. Another reason for the manager to transition the crawl phase or state is due to user interaction through the UI or using the REST API (pausing or stopping a crawl). The following table describes each control item and the phase it is part of. 


Item IdPhaseTask
startCrawl StartLogs a crawlBegin action in the audit log and then is sent through the onPublishEvent if any workflows are assigned.
preReprocessPre-ReprocessTriggers the failed documents processing stage, processing any failed items from a previous crawl.
identityCrawlCrawlTriggers an identity crawl, fetching identities from the given seed.
rootCrawlTriggers a content crawl, creating new items for the root paths.
deleteDeleteTriggers the process deletes stages, fetching "untouched" items from the snapshot.
postReprocessPost-ReprocessTriggers the failed documents processing stage, processing any failed items from the current crawl.
endCrawl EndLogs a crawlEnd action in the audit log and then is sent through the onPublishEvent if any workflows are assigned.


The following diagram shows all possible transitions during a crawl. Take into account that, as mentioned on the Crawl Phase section, no all crawl types transition through all crawl phases.


Stop


When a crawl is stopped, the manager assigned to the seed will trigger the following steps:

  • Clean up batches in memory waiting for a worker to pick them up.
  • Clean up batches waiting to be acknowledged by a worker.
  • Send a release seed request to all workers, which will clean up each worker's in memory queue.
  • If a container is in the middle of a scan, any item discovered after the stop will be ignored.


Pause and Resume


When a crawl is paused, the manager assigned to the seed will trigger the following steps:

  • Clean up batches in memory waiting for a worker to pick them up.
  • Clean up batches waiting to be acknowledged by a worker.
  • Send a release seed request to all workers, which will clean up each worker's in memory queue.
  • If a container is in the middle of a scan, any item discovered after the stop will be ignored.
  • All containers in progress will be marked as available, so they can be picked up again when the crawl is resumed. (This may cause some items to be discovered twice).

When a crawl is resumed, it will continue from the phase it was when paused, instead of restarting the crawl.




  • No labels