Batches are configured in the connector configuration and the Publisher Framework respects this. If no batching is defined, the Publisher Framework creates a one-time batch with only one document included.
On the publisher level, a developer can choose among certain batch types: BUFFER/ STREAM/ NONE
Transformers are used for transforming AspireObjects coming in jobs into some String format representation of this object required by the target repository. For example, when publishing to Elasticsearch we need to create a JSON structure of the Aspire document.
We support XML, JSON and simple String transformers
For more low-level handling of the transformation process, use the PublisherInfo.getTransformerFactory method to create transformers and use streams passed as parameters to transformers.
HttpClient (HC) is provided by the HttpConnection object. Whenever developing a publisher for REST-based target repositories, consider using this class.
HC supports REST based API and can execute GET, PUT, POST, DELETE methods
HC also supports streaming. This can be used in batching. For example, Elasticsearch publisher writes single documents to HC stream first and then on batch close this stream is posted to the Elasticsearch.
HC can be configured by the HttpProperties object.
HC configuration is flexible enough to accept changes even after the object is constructed. This opens possibilities for reconfiguring already created and possibly already pooled objects. For example, we can modify URL parameter value in created HC because we need different URL for normal bulk POST and other for some actions like index clean.
HC supports retry logic configured by specific parameters
HC can be configured by HttpErrorHandler. If this handler is provided the developer can get information about possible connection errors or other Http errors and react accordingly – either by throwing an exception or by continuing with retry logic.
If a document with action “deleteByQuery” arrives in publisher PF takes an appropriate action
The query document is first automatically transformed by the configured transformer. The developer must support the transformation in transformation script – for example in case of JSON he can introduce a section for this with “if (action == "deleteByQuery")” command. If this section is left empty the document is considered not to be transformed.
The developer must then in the PAP class implement delete by query logic in the method processDeleteByQuery by interpreting the syntax of “deleteByQuery” document.
When arriving in PAP.processDeleteByQuery(DeleteByQuery) the DeleteByQuery object (the object where the original “deleteByQuery” document is wrapped) can be translated by the supported Visitor objects into some meaningful string representation. The prepared visitor classes support delete by query format created by ArchiveExtractor utility (QueryForArchiveDefaultVisitorImpl). For example, in Elasticsearch publisher we can create part of Elasticsearch API command for getting all documents with the same “parentId” published previously and this way handle all the documents of the archive.
This publisher Simple File (SF) comes as a part of PF.
SF publishes all documents into single file.
SF was developed to help developers who are new to PF and wants to learn how to develop and deploy the specific publisher.
It is an advice when learning PF to always build and deploy this publisher first and then after running the crawl checking the result in publisher output file.
Resources/dxf/publisher.xml is an example how to create DXF file with publisher specific parameters
Resources/aspire.properties is an example how to use parameters for merging and hiding general DXF coming from PF itself with the specific DXF provided by the publisher.