RSS Connector Developer Information

The RSS connector uses the Rome library (see here) to download and parse each of the RSS feeds the connector is configured to crawl. When content is found, the URL is published as an Aspire job. This job is then processed on a pipeline that downloads the content.

Published jobs are optionally passed through the Apache Tika text extraction stage and passed through normal connector workflow stages.

The connector maintains a timestamp of the most recent content for each feed. This allows it to identify new content and only publish this on subsequent runs. The timestamps are stored in files in the snapshot directory.

Deletes are NOT processed by this connector. The process is purely additive. Updates will only happen if a subsequent feed contains the same url again with a later timestamp.

Information about the specific applications and components used in the RSS Connector is shown below:

  • No labels