Aspire 5.0 had a major architecture re-design, compared to its predecessors Aspire 3.x and 4.x, for the purpose of tackling the most common source of complexity in managing Aspire deployments: configuration, availability, and coordination of crawl execution.

The biggest change you would notice compared to prior versions is that there isn't a content-source anymore. The configuration of crawls has been split into re-usable entities with relationships to one another.

What used to be called a "content-source" now is a collection of related configuration objects:

  • Connector
    • Common connector behavior
  • Credential
    • To authenticate to a specific repository
  • Connection
    • Server IP/host/port
    • Connection properties (timeouts, concurrency, etc.)
  • Throttle and Routing Policies
    • How often should documents be processed
    • Which nodes should the documents be processed
  • Workflow
    • Sequence of rules to be executed for each document
  • Seed
    • Starting point of a single crawl to execute

In this new approach of configuration, you can configure everything only once and reuse them to create multiple seeds for the same source repository. So if you need to change the credentials, you don't have to do it on all seeds but rather on the credentials object only, and all seeds related to it will be affected.

Another big change is the introduction of a manager/worker architecture, where the manager nodes coordinate configuration, crawls and failure recovery, and the worker nodes only care about executing jobs (representing documents)

Other features:

  • Chained schedules, allowing for crawls to start only after other crawls have finished.
  • Tag based crawling, jobs of certain crawls can be delegated to certain worker nodes, allowing for Geo-located crawls
  • Out of the box throttling policies, allowing the crawls to throttle the execution of jobs across the cluster for certain crawls or related crawls (with the same connection or credential objects).
  • Brand-new UI
  • Re-designed REST API
  • Optimized for containerization
    • Official docker image available for download and use
  • No labels