You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

General scheme

  • The publisher is technically Aspire workflow component consisting of one bundle provided by developer with the code specific to target repository where documents coming from connectors are going to be published and the generic bundle called Publisher Framework (PF) with classes, interfaces and configurations shared among all publisher.  

  • Publisher developer provides OSGI bundle containing mainly implementation of PublisherAccessProvider (PAP) interface with the code specific to the target repository – e.g. Elasticsearch, Solr Developer must implement PAP methods like processAddUpdate and processDelete when publishing the documents coming from connectors. Aspire when loading this bundle in addition loads always other bundle called PF too, with classes common for all publishers – e.g. PublisherControllerImpl class. The developer's bundle contains also the DXF file for the specific configuration parts while the PF bundle contains the DXF file with general configuration which can be utilized by all publishers. 

  • When Aspire crawl starts the PF PublisherControllerImpl object is the first point where all documents coming from connectors arrive and they are then propagated to other PF objects, mainly PAP, to be published with the help of PF connection objects. 

  • PublisherControllerImpl holds one PublisherInfo object created by PAP when the specific publisher is loaded. PublisherInfo contains values of all configuration parameters provided by the user in DXF form. 

  • The publisher when loaded contains one instance of PublisherControllerImpl class and one instance of PublisherInfo class shared among all threads. 

  • PublisherControllerImpl holds one instance of PAP class shared among all the publishing threads.  

  • PublisherControllerImpl handles connection pool of PublisherRepositoryConnection implementation objects. 

  • Connection objects are created by the PublisherConnectionController implementationPublisherInfo provides PublisherConnectionController implementation object. 

  • Connection objects must be able to authenticate to the target repository with provided credentials.   

  • PAP uses connection objects when writing data into target repositories.   

  • Some very common connection classes are provided by the PF itself, like HttpClient for REST, others must be provided by the developer. 

  • PublisherControllerImpl also handles batching. Batches are the mean for grouping documents into larger units before sending them to the target repository. The batch normally requests connection object from the pool and releases this connection object after the batch close method is issued. More threads can participate on the same batch – hence connections must be thread safe.  

  • PublisherControllerImpl creates Aspire standard ComponentBatch objects based on information in coming jobs. PublisherBatch objects are then created by ComponentBatch objects to be passed to PAP methods.

Batching

  • Batches are configured in the connector configuration and PF respects this. If no batching defined PF creates one-time batch with only one document included. 

  • On the publisher level the developer can choose among certain batch types: BUFFER/ STREAM/ NONE 

  • For STREAM batch type PF gets connection from the pool on batch start and keeps sending this connection to PAP methods in the course of the whole batch. The connection is released when closing the batch. 

  • For BUFFER batch type the connection is claimed from the pool at the beginning of batch close, passed to PAP endBatch method and released afterwards. This means that the developer should buffer all documents in the course of batch. For this purpose, so called batch data buffer is available in PublisherBatch object.  

  • Besides mentioned batch types PF supports also so called multi server batches. Batch factory creates this kind of batch when more URL's are provided in the configuration. The purpose of this is to support ability of publishing documents to more servers. Broadcasting and round robin supported. 

  • There is also BatchAdapter object available in PublisherBatch. This object can be used for reporting error and other messages to Aspire framework.

Transformers

  • Transformers are used for transforming AspireObjects coming in jobs into some String format representation of this object required by the target repository. For example, when publishing to Elasticsearch we need to create JSON structure of the Aspire document. 

  • We support XML, JSON and simple String transformers 

  • Transformers are configured by specifying transform file – Groovy script for JSON or XSLT template for XML transformer. 

  • Transform files are typically provided by the developer of the specific publisher. For example Elasticsearch publisher bundle is pre-packed with transform.groovy script. In runtime the user can configure the publisher with his own transform file. 

  • Transformer functionality can be used by calling PublisherInfo.transform(AspireObject doc) method which produces string result of the transformation. 

  • For more low-level handling of transformation process there is a method PublisherInfo.getTransformerFactory which can be used by developer for creating transformers and using streams passed as parameters to transformers.

HttpClient

  • HttpClient (HC) is provided by the HttpConnection object. 

  • Whenever one develops the publisher for REST based target repositories he should consider using this class. 

  • HC was primarily developed for writing AspireObject documents  

  • If required HC uses transformers for converting AspireObjects before writing 

  • HC supports REST based API and can execute GET, PUT, POST, DELETE methods 

  • HC also supports streaming. This can be used in batching. For example, Elasticsearch publisher writes single documents to HC stream first and then on batch close this stream is posted to the Elasticsearch.  

  • HC can be configured by the HttpProperties object. 

  • HC configuration is flexible enough to accept changes even after the object is constructed. This opens possibilities for reconfiguring already created and possibly already pooled objects. For example, we can modify URL parameter value in created HC because we need different URL for normal bulk POST and other for some actions like index clean. 

  • HC supports retry logic configured by specific parameters 

  • HC can be configured by HttpErrorHandler. If this handler is provided the developer can get information about possible connection errors or other Http errors and react accordingly – either by throwing an exception or by continuing with retry logic. 

Delete By Query

  • If a document with action “deleteByQuery” arrives in publisher PF takes an appropriate action 

  • The query document is first automatically transformed by the configured transformer. The developer must support the transformation in transformation script – for example in case of JSON he can introduce a section for this with “if (action == "deleteByQuery")” command.  If this section is left empty the document is considered not to be transformed. 

  • The developer must then in the PAP class implement delete by query logic in the method processDeleteByQuery by interpreting the syntax of “deleteByQuery” document. 

  • When arriving in PAP.processDeleteByQuery(DeleteByQuery) the DeleteByQuery object (the object where the originaldeleteByQuery” document is wrapped) can be translated by the supported Visitor objects into some meaningful string representation. The prepared visitor classes support delete by query format created by ArchiveExtractor utility (QueryForArchiveDefaultVisitorImpl). For example, in Elasticsearch publisher we can create part of Elasticsearch API command for getting all documents with the same “parentId” published previously and this way handle all the documents of the archive.

Simple File

  • This publisher Simple File (SF) comes as a part of PF. 

  • SF publishes all documents into single file. 

  • SF was developed to help developers who are new to PF and wants to learn how to develop and deploy the specific publisher. 

  • It is an advice when learning PF to always build and deploy this publisher first and then after running the crawl checking the result in publisher output file. 

  • Resources/dxf/publisher.xml is an example how to create DXF file with publisher specific parameters 

  • Resources/aspire.properties is an example how to use parameters for merging and hiding general DXF coming from PF itself with the specific DXF provided by the publisher. 



  • No labels