Page History

The biggest change in Aspire 4.0 is related to the way the connectors work, they now use an external database (MongoDB) to hold all the crawling information such as document urls, status, statistics, snapshots (for incrementals), logs, etc. The idea behind this change is allow the connectors to work distributed from its very architectural design.

Now all the connectors run under the same principles, using the same logic, so each connector is more like a Repository Access Provider so we keep them as simple as possible, rather than a complex (multi-threaded) crawling application; so the complexity of distributed crawling and multi-threading relies on the Connector Framework.

Among the responsibilities of Responsibilities that the Connector Implementations we developers have to implement:

Scan the repository document containers to discover new documents to process
Populate document metadata
Fetch document content

Responsibilities of the Connector Framework (you don't have to worry about this):

Multi-threading processing
Distribute the crawl processing
Store and fetch documents from the database.
Maintain a snapshot for incremental crawling
Handle statistics
Start, Pause, Stop, Resume the crawl
Send the documents to the respective workflows for processing and search engine indexing

The following diagram illustrates how the Connector Framework interacts with the connector implementation in order to run a crawl:

Image AddedImage Removed

If you want to learn more about the Connector Framework check out NoSQL Connector Framework.

...

Page tree

Versions Compared

Old Version 7

New Version 8

Key