Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The biggest change in Aspire 4.0 is related to the way the connectors work, they . They now use an external database (MongoDB, HBase, Elasticsearch) to hold all of the crawling information such as document urls, status, statistics, snapshots (for incrementals), logs, etc. The idea behind this change is to allow This allows the connectors to work distributed from its very the architectural design.

Now all All of the connectors run under the same principles, using the same logic, so that each connector is more like a Repository Access Provider so we . We keep them as simple as possible, rather than a complex (multi-threaded) crawling application; so the . The complexity of distributed crawling and multi-threading relies on the Connector Framework.

What's next?

Children Display
alltrue

 

Responsibilities that the Connector developers

have to

implement:

  • Scan
the
  •  the repository document containers to discover new documents to process
  • Populate
document
  •  document metadata
  • Fetch
document
  •  document content

If you want to

start creating

create your connector right away, go

to

to Write

your own From

from Scratch

Responsibilities of the Connector Framework (you don't have to worry about this):

  • Multi-threading processing
  • Distribute the crawl processing
  • Store and fetch documents from the database.
  • Maintain a snapshot for incremental crawling (adding, updating or deleting documents)
  • Handle statistics
  • Start, Pause, Stop, Resume the crawl
  • Send the documents to the respective workflows for processing and search engine indexing

 

 

The following diagram illustrates how the Connector Framework interacts with the connector implementation in order to run a crawl:

Image Removed

Tip

If you want to learn more about the Connector Framework check

out  
Panel
titleWhat's next?

Children Display
alltrue



Image Added