As of Aspire 4.0, Elasticsearch is a supported NoSQL database that can be used to maintain the Crawl State.
The Aspire Elasticsearch Provider is the component that is responsible for talking to Elasticsearch on behalf of Aspire. All configuration for the Elasticsearch Provider in Aspire is done in the settings.xml file.
The Elasticsearch NoSQL Provider for Aspire requires Elasticsearch 7.x to run. It does not run with previous versions.
<!-- noSql database provider for the 4.0 connector framework --> <noSQLConnectionProvider> <implementation>com.searchtechnologies.aspire:aspire-elasticsearch-provider</implementation> <url>http://localhost:9200</url> </noSQLConnectionProvider>
Aspire will create one set of Elasticsearch indexes for each content source configured. When the content source is deleted, the indexes will be dropped. The index name has following structure:
Examples of index names: aspire-dev-aspider_web_crawler-processqueue, aspire-dev-aspider_web_crawler-snapshot, aspire-dev-group_expansion_manager-usersandgroups
<!-- noSql database provider for the 4.0 connector framework --> <noSQLConnectionProvider> <implementation>com.searchtechnologies.aspire:aspire-elasticsearch-provider</implementation> <url>http://localhost:9200</url> <authentication type="basic"> <username>admin</username> <password>encrypted:password</password> </authentication> </noSQLConnectionProvider>
Elasticsearch provider can be configured to use Basic authentication if required by the Elasticsearch server administrator. Username and password must be provided. Password must be encrypted by standard Aspire encryption utilities.
<!-- noSql database provider for the 4.0 connector framework --> <noSQLConnectionProvider> <implementation>com.searchtechnologies.aspire:aspire-elasticsearch-provider</implementation> <url>http://localhost:9200</url> <claimPrefetch>300</claimPrefetch> <claim>100</claim> </noSQLConnectionProvider>
Elasticsearch provider can claim items from queues by larger units and after changing queue items status they are sent back to Elasticsearch as a bulk unit. By this technique we can achieve better performance. We can configure claim unit size parameters based of the current Aspire installation (e.g. standalone/ distributed mode, etc.). This is how it works:
<!-- noSql database provider for the 4.0 connector framework --> <noSQLConnectionProvider> <implementation>com.searchtechnologies.aspire:aspire-elasticsearch-provider</implementation> <url>http://localhost:9200</url> <claimPrefetch>300</claimPrefetch> <claim>100</claim> <keepSearchContextAlive>5m</keepSearchContextAlive> <authentication type="basic"> <username>admin</username> <password>encrypted:password</password> </authentication> <debugOutFile>/tmp/aspire/profile.txt</debugOutFile> <maxRetries>3</maxRetries> </noSQLConnectionProvider>