The crawl phase indicates the action the crawl is currently performing. Each crawl will will transition to the following phase if a set of conditions is met. Not all crawl types, transition through all phases. These are the phases for each crawl type:
When a seed crawl enters a phase, the state is always "initializing". The manager in charge of monitoring the seed state will then create a "control item", add it to the queue and set the state to "running". Any worker node will get the control item and process it. At this point the manager will transition the crawl to the next phase until the "control item" job is done and there are no pending jobs in the queue. Another reason for the manager to transition the crawl phase or state is due to user interaction through the UI or using the REST API (pausing or stopping a crawl). The following table describes each control item and the phase it is part of.
Item Id | Phase | Task |
---|---|---|
start | Crawl Start | Logs a crawlBegin action in the audit log and then is sent through the onPublishEvent if any workflows are assigned. |
preReprocess | Pre-Reprocess | Triggers the failed documents processing stage, processing any failed items from a previous crawl. |
identityCrawl | Crawl | Triggers an identity crawl, fetching identities from the given seed. |
root | Crawl | Triggers a content crawl, creating new items for the root paths. |
delete | Delete | Triggers the process deletes stages, fetching "untouched" items from the snapshot. |
postReprocess | Post-Reprocess | Triggers the failed documents processing stage, processing any failed items from the current crawl. |
end | Crawl End | Logs a crawlEnd action in the audit log and then is sent through the onPublishEvent if any workflows are assigned. |
The following diagram shows all possible transitions during a crawl. Take into account that, as mentioned on the Crawl Phase section, no all crawl types transition through all crawl phases.
When a crawl is stopped, the manager assigned to the seed will trigger the following steps:
When a crawl is paused, the manager assigned to the seed will trigger the following steps:
When a crawl is resumed, it will continue from the phase it was when paused, instead of restarting the crawl.