Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

General

Scheme

Schema


The publisher is technically an Aspire workflow component consisting of one bundle provided by a developer with the code specific to a target repository, where documents coming from connectors are going to be published and the generic bundle called Publisher Framework (PF) with classes, interfaces and configurations shared among all publishers.  The following describes how the process works.

Panel
titleOn this page

Table of Contents

  1. The Publisher developer
provides
  1. creates an OSGI bundle containing mainly an implementation of a PublisherAccessProvider (PAP) interface with the code specific to the target repository – e.g. Elasticsearch, Solr.  
    • The developer must implement PAP methods
like processAddUpdate and processDelete when
    • like processAddUpdate and processDelete for publishing documents coming from connectors. 
    • When loading this bundle, Aspire also loads another bundle called
by the
    • Publisher Framework
too,
    • with classes common for all publishers – e.g. PublisherControllerImpl class. 
    • The developer's bundle contains also the DXF file for the specific configuration parts while the the Publisher Framework bundle contains the DXF file with general configuration that can be utilized by all publishers. 
  1. At the process of loading the publisher is initialized with DXF properties
  2. When
an Aspire crawl starts, the Publisher Framework PublisherControllerImpl object is the first point where all documents coming from connectors arrive and are then propagated to other Publisher Framework objects; mainly PAP, to be published with the help of Publisher Framework connection objects
  1. loaded, the publisher contains one instance of PublisherControllerImpl class and one instance of PublisherInfo class to be shared among all connector processing threads.
    • PublisherControllerImpl holds one instance of PAP class also shared among all processing threads.
    • PublisherControllerImpl holds one PublisherInfo object created by PAP when the specific publisher is loaded.

 
  • PublisherControllerImpl holds one instance of PAP class shared among all the publishing threads. 
      • PublisherInfo contains values of all configuration parameters provided by the user in DXF form.
    When loaded, the publisher contains one instance of PublisherControllerImplclass and one instance of PublisherInfoclass shared among all threads.
      • PublisherControllerImpl handles connection pool of PublisherRepositoryConnection implementation objects. 
    1. When an Aspire crawl starts, the Publisher Framework PublisherControllerImpl object is the first point where all documents coming from connectors arrive and are then propagated to other Publisher Framework objects; mainly PAP, to be published with the help of Publisher Framework connection objects.
    2. Connection objects are created by
    the PublisherConnectionControllerimplementation
    1. the PublisherConnectionController implementation
      • PublisherInfo provides the PublisherConnectionController implementation object.
      • Connection objects must be able to authenticate to the target repository with provided credentials.  
      • PAP uses connection objects when writing data into target repositories.  
      • Some
    very common
      • general connection classes are provided by the Publisher Framework,
     like HttpClient for
      •  like HttpConnection for REST. Others must be provided by the developer.
    1. PublisherControllerImpl also handles batching. 
      • Batches are the
    mean
      • means for grouping documents into larger units before sending them to the target repository. 
      • The batch normally requests connection object from the pool and releases this connection object after the batch close method is issued. 
      • More threads can participate on the same batch – hence connections must be thread safe.  
      • PublisherControllerImpl creates Aspire standard ComponentBatch objects based on information in coming jobs. 
      • PublisherBatch objects are then created by ComponentBatch objects to be passed to PAP methods.
    Panel
    titleOn this page

    Table of Contents


    Image Modified


    Batching


    Batches are configured in the connector

    configuration and the Publisher Framework respects this

    developer settings. If no batching is defined, the Publisher Framework creates a one-time batch with only one document included. 

    On the publisher level, a developer

    Developer can choose among

    certain

    batch types: BUFFER/ STREAM/ NONE

     

    :

    • For STREAM batch type, the Publisher Framework gets connection from the pool on batch start and keeps sending this connection to PAP methods in the course of the whole batch.
      • The connection is released when closing the batch.
    • For BUFFER batch type the connection is claimed from the pool at the beginning of batch close, passed to PAP endBatch method and released afterwards. 
      • This means that the developer should buffer all documents in the course of batch. For this purpose, so called batch data buffer is available in PublisherBatch object. 
    • The Publisher Framework also supports "so called" multi server batches.
      • Batch factory creates this kind of batch when more URL's are provided in the configuration. 
      • The purpose of this is to support the ability to publish documents to more servers. 
      • Broadcasting and round robin are supported.
    • There is a BatchAdapter object available in PublisherBatch. 
      • This object can be used for reporting error and other messages to the Aspire framework.

    Image Modified


    Transformers


    Transformers are used for transforming AspireObjects coming in jobs into some String format representation

    of this object

    required by the target repository. For example, when publishing to Elasticsearch

    we

    , you need to create a JSON structure of the Aspire document. 

    We support The Publisher Framework supports XML, JSON and simple String transformers 

    • Transformers are configured by specifying transform file – Groovy script for JSON or XSLT template for XML transformer. 
    • Transform files are typically provided by the developer of the specific publisher. 
      • For example, the Elasticsearch publisher bundle is pre-packed with transform.groovy script.
      In runtime the user
      •  
      • In run-time, users can configure the publisher with
      his
      • their own transform file. 
    • Transformer functionality can be used by calling the PublisherInfo.transform(AspireObject doc) method, which produces a string result of the transformation. 

    Note: For more low-level handling of the transformation process,

     

    use the PublisherInfo.getTransformerFactory method to create transformers and use streams passed as parameters to transformers. 

    Image Modified


    HttpClient


    HttpClient

    (HC) is

     is provided by the HttpConnection object. 

    Whenever 

    When developing a publisher for REST-based target repositories, consider using this class. 

      HC
    • HttpClient was primarily developed for writing AspireObject
    • documents  
    • documents.
    • If required
    • HC
    • HttpClient uses transformers for converting AspireObjects before writing.
    • HttpClient 
    • HC supports REST-based API and can execute GET, PUT, POST, DELETE methods methods.
    • HC HttpClient also supports streaming. 
      • This can be used in batching. For example, Elasticsearch publisher writes single documents to
      HC stream first and then
      • the HttpClient stream first. Then on batch close, this stream is posted to the Elasticsearch.  
    • HC HttpClient can be configured by the HttpProperties object. 
    • HC HttpClient configuration is flexible enough to accept changes even after the object is constructed. 
      • This opens possibilities for reconfiguring already created and possibly already pooled objects. 
      • For example, we
      can
      • may need to modify an URL parameter value in already created
      HC
      • configuration because we need different
      URL
      • URLs for
      normal
      • bulk POST and other
      for some
      • actions
      like
      • such as index clean. 
    • HC HttpClient supports retry logic configured by specific parameters HC can .
    • HttpClient can be configured
    • by HttpErrorHandler
    • by HttpErrorHandler. 
      • If this handler is provided, the developer can get information about possible connection errors or other Http related errors and
    • react
      • act accordingly – either by throwing an exception or by continuing with retry logic. 


    Image Modified


    Delete By Query


    If a document with action

    deleteByQuery

    “deleteByQuery” arrives in a publisher

    PF

    , the Publisher Framework takes

    an

    appropriate action

     

    .

    1. The query document is first automatically transformed by the configured transformer
    . The developer
    1. if transformation is configured for the publisher. 
    2. You must support the transformation in a transformation script
    – for example in case of JSON he can
      • For example, in  JSON, introduce a section
    for this
      • with
    if
      • “if (action == "deleteByQuery")” command.
      If
      •   
      • Leave this section
    is left
      • empty if the deleteByQuery document
    is considered
      • should not
    to in
      • be transformed. 
    The developer must then
    1. In the PAP class implement delete by query logic in the method processDeleteByQuery by interpreting the syntax of
    deleteByQuery
    1. “deleteByQuery” document. 
      • When arriving in PAP.processDeleteByQuery(DeleteByQuery) the
    DeleteByQuery
      • DeleteByQuery object (the object where the original
    deleteByQuery
      • “deleteByQuery” document is wrapped) can be
    translated
      • translated by the supported Visitor objects into some meaningful string representation. 
      • The prepared visitor classes support delete by query format created by ArchiveExtractor utility (QueryForArchiveDefaultVisitorImpl). 
      • For example
    ,
      • in the Elasticsearch publisher
    we
      • , you can create part of
    Elasticsearch API command for getting
      • an Elasticsearch REST request to get all documents with the same
    parentId” published previously and this way handle
      • “parentId” published previously; hence, handling deletion of all the documents
    of
      • from the archive. 

    Image Modified



    Simple File

    This publisher

    Simple File

    (SF) comes as a part of PF

    publisher is a "Hello world!" part of the Publisher Framework for learning purposes only. 

    SF
    • Simple File publishes all documents into a single file.
     
    SF
    • Simple File was developed to help developers who are new to
    PF
    • the Publisher Framework and
    wants
    • want to learn how to develop and deploy
    the
    • a specific publisher.
     always
    It is an advice when learning PF to 
    • When learning the Publisher Framework, always build and deploy this publisher first and then after running the crawl
    checking the
    • , check the result in the publisher output file.
     
    • Resources/dxf/publisher.xml is an example of how to create a DXF file with publisher specific parameters
     
    • Resources/aspire.properties is an example how to use parameters

    for merging
    • to merge and

    hiding
    • hide general DXF coming from

    PF itself with
    • the Publisher Framework with the specific DXF provided by the publisher. 

    Image Modified