Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

General

scheme

Scheme


The publisher is technically an Aspire workflow component consisting of one bundle provided by a developer with the code specific

to 

to a target repository, where documents coming from connectors are going to be published and the generic bundle called Publisher Framework (PF) with classes, interfaces and configurations shared among all

publisher

publishers

 

 The following describes how the process works.

  1. The Publisher developer provides an OSGI bundle containing
mainly implementation of PublisherAccessProvider 
  1. mainly an implementation of a PublisherAccessProvider (PAP)
 interface with
  1.  interface with the code specific to the target repository – e.g. Elasticsearch,
 Solr
  1.  Solr.  
  2. The developer
 Developer
  1. must implement PAP methods
like processAddUpdate and processDelete when
  1. like processAddUpdate and processDelete when publishing
the
  1. documents coming from connectors. 
Aspire
    • When
 when
    • loading this
bundle in addition loads always other bundle called PF
    • bundle, Aspire also loads another bundle called by the Publisher Framework too, with classes common for all publishers
– e class.
    • – e.g.
 PublisherControllerImpl
    •  PublisherControllerImpl class. 
    • The developer's bundle contains
also the DXF file for
    • also the DXF file for the specific configuration parts
while the PF bundle contains the DXF file with
    • while the the Publisher Framework bundle contains the DXF file with general configuration
which
    • that can be utilized by all publishers. 
  1. When an Aspire crawl starts
the PF PublisherControllerImpl object is the first point where all documents coming from connectors arrive and they 
  1. , the Publisher Framework PublisherControllerImpl object is the first point where all documents coming from connectors arrive and are then propagated to
other PF objects,
  1. other Publisher Framework objects; mainly PAP, to be published with the help
of PF connection
  1. of Publisher Framework connection objects.
 
    • PublisherControllerImpl holds one PublisherInfo object
PublisherControllerImpl holds one PublisherInfo object
    • created by PAP when the specific publisher is loaded. 
PublisherInfo
    • PublisherInfo contains
 contains
    • values of all configuration parameters provided by the
user in
    • user in DXF form.
 shared
The publisher when loaded contains one instance of PublisherControllerImpl class and one instance of PublisherInfo class 
  1. When loaded, the publisher contains one instance of PublisherControllerImplclass and one instance of PublisherInfoclass shared among all threads.
  holds
    PublisherControllerImpl
      • PublisherControllerImpl holds one instance of
    PAP class shared
      • PAP class shared among all the publishing threads.
     
      •  
    PublisherControllerImpl handles
      • PublisherControllerImpl handles connection pool
    of PublisherRepositoryConnection implementation objects
      • of PublisherRepositoryConnection implementation objects. 
    1. Connection objects are created by
    the PublisherConnectionController implementation
    1. the PublisherConnectionControllerimplementation
    PublisherInfo provides PublisherConnectionController
      • PublisherInfo provides the PublisherConnectionController implementation
     implementation
      • object.
     
    Connection objects must
    1. Connection objects must be able to authenticate to the target repository with provided credentials.  
     
    1. PAP uses
    connection objects when writing
    1. connection objects when writing data into target
    repositories
    1. repositories.  
     
    1. Some very common connection classes are provided by
    the PF itself, like HttpClient for REST, others
    1. the Publisher Framework, like HttpClient for REST. Others must be provided by the developer.
     
    PublisherControllerImpl also
    1. PublisherControllerImpl also handles batching. 
      • Batches are the mean for grouping documents into larger units before sending them to the target repository. 
      • The batch normally requests connection object from the pool and releases this connection object after the batch close method is issued. 
      • More threads can participate on the same batch – hence connections must be thread safe.
       objects
      •   
    PublisherControllerImpl creates Aspire standard ComponentBatch
    1. PublisherControllerImpl creates Aspire standard ComponentBatch objects based on information in coming jobs. 
    PublisherBatch
    1. PublisherBatch objects
     objects
    1. are then created
    by ComponentBatch objects
    1. by ComponentBatch objects to be passed to PAP methods.
    Panel
    titleOn this page

    Table of Contents


    Image Modified


    Batching



    Batches are configured in the connector configuration and

    PF respects

    the Publisher Framework respects this. If no batching

    defined PF creates

    is defined, the Publisher Framework creates a one-time batch with only one document included. 



    On the publisher level

    the

    , a developer can choose among certain batch types: BUFFER/ STREAM/ NONE 



    • For STREAM batch type
    PF gets
    • , the Publisher Framework gets connection from the pool on batch start and
    keeps
    • keeps sending this connection to PAP methods in the course of the whole batch.
      • The connection is released when closing the batch.
     
    • For BUFFER batch type the connection is claimed from the pool at the beginning of batch close, passed to PAP endBatch method and released afterwards. 
      • This means that the developer should buffer all documents in the course of batch. For this purpose, so called batch data buffer is available in PublisherBatch object. 
     Besides mentioned batch types PF supports also
    • The Publisher Framework also supports "so called" multi server batches.
      • Batch factory creates this kind of batch when more URL's are provided in the configuration. 
      • The purpose of this is to support the ability
    of publishing
      • to publish documents to more servers. 
      • Broadcasting and round robin are supported.
     
    • There is
    also BatchAdapter
    • a BatchAdapter object available in PublisherBatch. 
      • This object can be used for reporting error and other messages to the Aspire framework.

    Image Removed


    Image Modified


    Transformers



    Transformers are used for transforming AspireObjects coming in jobs into some String format representation of this object required by the target repository. For example, when publishing to Elasticsearch we need to create a JSON structure of the Aspire document. 



    We support XML, JSON and simple String transformers 



    • Transformers are configured by specifying transform file – Groovy script for JSON or XSLT template for XML transformer. 
    • Transform files are typically provided by the developer of the specific publisher. For example Elasticsearch publisher bundle is pre-packed with transform.groovy script. In runtime the user can configure the publisher with his own transform file. 
    • Transformer functionality can be used by calling PublisherInfo.transform(AspireObject doc) method which produces string result of the transformation. 



    For more low-level handling of the transformation process

    there is a method

    ,  use the PublisherInfo.getTransformerFactory

    which can be used by developer for creating

     method to create transformers and

    using

    use streams passed as parameters to transformers.


    Image Modified


    HttpClient



    HttpClient (HC) is provided by the HttpConnection object. 

    Whenever onedevelops the publisher

    Whenever developing a publisher for REST-based target repositories

    he should

    , consider using this class. 



    • HC was primarily developed for writing AspireObject
    documents  
    • documents  
    • If required HC uses transformers for converting AspireObjects before writing 


    • HC supports REST based API and can execute GET, PUT, POST, DELETE

    methods 
    • methods 

    • HC also supports streaming. This can be used in batching. For example, Elasticsearch publisher writes single

    documents
    • documents to HC stream first and then on batch close this stream is posted to the Elasticsearch.

      
    •   

    • HC can be configured by the HttpProperties object. 

    • HC configuration is flexible enough to accept changes even after the object is constructed. This opens possibilities for reconfiguring already created and possibly already pooled objects. For example, we can modify URL parameter value in created HC because we need different URL for normal bulk POST and other for some

    actions
    • actions like index clean. 

    • HC supports retry logic configured by specific

    parameters 
    • parameters 

    • HC can be configured by HttpErrorHandler. If this handler is provided the developer can get information about possible connection errors or other Http errors and react accordingly – either by throwing an exception or by continuing with retry logic. 

    Image Modified


    Delete By Query


    • If a document with action “deleteByQuery” arrives in publisher PF takes an appropriate action 

    • The query document is first automatically transformed by the configured transformer. The developer must support the transformation in transformation script – for example in case of JSON he can introduce a section for this with “if (action == "deleteByQuery")” command.  If this section is left empty the document is considered not to be transformed. 

    • The developer must then in the PAP class implement delete by query logic in the method processDeleteByQuery by interpreting the syntax of “deleteByQuery” document. 

    • When arriving in PAP.processDeleteByQuery(DeleteByQuery) the DeleteByQuery object (the object where the originaldeleteByQuery” document is wrapped) can be translated by the supported Visitor objects into some meaningful string representation. The prepared visitor classes supportdelete by query format created by ArchiveExtractor utility (QueryForArchiveDefaultVisitorImpl). For example, in Elasticsearch publisher we can create part of Elasticsearch API command for getting all documents with the same “parentId” published previously and this way handle all the documents of the archive.

    Image Modified



    Simple File


    • This publisher Simple File (SF) comes as a part of PF. 

    • SF publishes all documents into single file. 

    • SF was developed to help developers who are new to PF and wants to learn how to develop and deploy the specific publisher. 

    • It is an advice when learning

      PF to

      PF to always build and deploy this publisher first and then after running the crawl checking

      the

       the result in publisher output file. 

    • Resources/dxf/publisher.xml is an example how to create DXF file with publisher specific parameters 

    • Resources/aspire.properties

      is

       is an example how to use parameters for merging and hiding general DXF coming from PF itself with the specific DXF provided by the publisher. 

    Image Modified



    Panel
    titleRelated pages