Migration guide

Before starting to migrate your Aspire deployments to Aspire 5.0, it is strongly advised to understand the architectural change: Aspire 5.0 Architecture

Migrating to Aspire 5.0 is a process that not only changes how the configuration for the crawls are done, but also changes to the hardware architecture must be considered.

The following areas must be taken into consideration:

Hardware
Crawl configuration

Hardware

Aspire 5.0 deployments consists of two distinct types of nodes: Manager and Workers.

In prior versions increasing the number of nodes implied: high availability, but also horizontally scaling crawl capacity which in some cases high availability was desired, but without increasing the crawl throughput. So if you had 2 Aspire nodes, you had twice the capacity of a single server.

In Aspire 5.0 you can separate the high availability requirements from the crawl capacity requirements, by allocating only the number of worker nodes needed to match your required throughput.

For high availability it is recommended to have at least 2 manager nodes, as if one fails, the other one can assume the work from the failed one.

Resource requirements:

Node

Minimum nodes

Recommended nodes

Minimum

Recommended

Manager

1

2

4 GB RAM

2 CPU cores

8 GB RAM

4 CPU cores

Worker

1

2

8 GB RAM

4 CPU cores

16 GB RAM

4 CPU cores

These recommendations are based on usual workloads, fine tuning is recommended especially if the workload consists of large files (over 100MB of average size)

Crawl Configuration

Configuring Aspire 5.0 is where the most time could be spent during a migration, as the old "content-source" configurations have been split into different sections (it used to be 4 xml files per content source, now it can be more than 7 entities related to each other), depending on each connector.

Each connector determines what goes where, but roughly speaking this is how they should now be configured:

connector instance
- General behavior of connector application inside the worker. Mostly all properties under "Advanced Configuration" in Aspire 4.0 are present here. Connectors API
- A single connector instance can be reused for many different connections.
credential
- All access related properties, account names, passwords, authentication type, etc. Credentials API
- A single credential instance can be reused for many different connections.
connection
- Everything that has to do with the actual connection to the repository like: server URL, connection timeouts, proxies, etc. Connections API
- Can be associated with 1 credential instance.
- Can be associated with 1 connector instance
workflow
- Same old workflow, this must be configured from scratch on the UI or via REST commands, as this is no longer an xml file. Workflow API
schedule
- Similar to the "content-source" schedules in Aspire 4.0, it supports time schedules, but also supports the new "sequence" schedules which can trigger crawls after another schedule has been completed. Schedules API
policies
- New to Aspire 5.0, there are two types of policies Policies API.
  - routing
    - Determines which worker nodes can receive jobs flagged with certain tags. Must be applied to seeds.
  - throttle
    - Throttles job batch delivery to worker nodes, allowing the crawl rate to be controlled.
    - Can be applied to seeds, connections or credentials
seed
- Starting point of a crawl. Seeds API
- In Aspire 4.0 this was a list of URLs in the same content source, or a file containing all the seed URLs. In Aspire 5.0 each seed must be configured separately, and crawl independently of each other.
- Can be associated with one or more schedules
- Can be associated with 1 connection instance
- Can be associated with 1 throttle policy
- Can be associated with 0 or more routing policies
- Can be associated with 0 or more workflows (will execute sequentially)

What's next?

Workflows - Migration Guide

Page tree

Hardware

Resource requirements:

Crawl Configuration

What's next?

Contact Us: [email protected]