scheme to publisher

The Publisher developer creates an OSGI bundle containing mainly an implementation of a PublisherAccessProvider (PAP) interface with

Publisher developer provides OSGI bundle containing mainly implementation of PublisherAccessProvider (PAP) interface with

the code specific to the target repository – e.g. Elasticsearch,

Solr. Developer

Solr.
- The developer must implement PAP methods

like processAddUpdate and processDelete when publishing the

- like processAddUpdate and processDelete for publishing documents coming from connectors.

Aspire when

- When loading this

bundle in addition loads always other bundle called PF too,

- bundle, Aspire also loads another bundle called Publisher Framework with classes common for all publishers

– e

- – e.g. PublisherControllerImpl class.

PublisherControllerImpl class.

- The developer's bundle contains

also the DXF file for

- also the DXF file for the specific configuration parts

while the PF bundle contains the DXF file with

- while the the Publisher Framework bundle contains the DXF file with general configuration

which

- that can be utilized by all publishers.

When Aspire crawl starts the PF PublisherControllerImpl object is the first point where all documents coming from connectors arrive and they are then propagated to other PF objects, mainly PAP, to be published with the help of PF connection objects.

PublisherControllerImpl holds one PublisherInfo

At the process of loading the publisher is initialized with DXF properties
When loaded, the publisher contains one instance of PublisherControllerImpl class and one instance of PublisherInfo class to be shared among all connector processing threads.
- PublisherControllerImpl holds one instance of PAP class also shared among all processing threads.
- PublisherControllerImpl holds one PublisherInfo object

object

- created by PAP when the specific publisher is loaded.

PublisherInfo contains

- PublisherInfo contains values of all configuration parameters provided by the

user in

- user in DXF form.

The publisher when loaded contains one instance of PublisherControllerImpl class and one instance of PublisherInfo class shared among all threads.

PublisherControllerImpl holds one instance of PAP class shared among all the publishing threads.

PublisherControllerImpl handles connection pool of PublisherRepositoryConnection implementation objects.

- PublisherControllerImpl handles connection pool of PublisherRepositoryConnection implementation objects.
When an Aspire crawl starts, the Publisher Framework PublisherControllerImpl object is the first point where all documents coming from connectors arrive and are then propagated to other Publisher Framework objects; mainly PAP, to be published with the help of Publisher Framework connection objects.
Connection objects are created by

the PublisherConnectionController implementation. PublisherInfo provides PublisherConnectionController implementation object.

the PublisherConnectionController implementation.
- PublisherInfo provides the PublisherConnectionController implementation object.
- Connection objects must

Connection objects must

- be able to authenticate to the target repository with provided credentials.

- PAP uses

connection objects when writing

- connection objects when writing data into target

repositories

- repositories.

- Some

very common

- general connection classes are provided by

the PF itself, like HttpClient for REST, others

- the Publisher Framework, like HttpConnection for REST. Others must be provided by the developer.

also

PublisherControllerImpl

PublisherControllerImpl also handles batching.
- Batches are the

mean

- means for grouping documents into larger units before sending them to the target repository.
- The batch normally requests connection object from the pool and releases this connection object after the batch close method is issued.
- More threads can participate on the same batch – hence connections must be thread safe.

objects

PublisherControllerImpl creates Aspire standard ComponentBatch

- PublisherControllerImpl creates Aspire standard ComponentBatch objects based on information in coming jobs.

PublisherBatch objects

- PublisherBatch objects are then created

by ComponentBatch objects

- by ComponentBatch objects to be passed to PAP methods.

Image Modified

Batching

Batches are configured in the connector

configuration and PF respects this

developer settings. If no batching

defined PF creates

is defined, the Publisher Framework creates a one-time batch with only one document included.

On the publisher level the developer

Developer can choose among

certain

batch types: BUFFER/ STREAM/ NONE

:

For STREAM batch type

PF gets

, the Publisher Framework gets connection from the pool on batch start and

keeps

keeps sending this connection to PAP methods in the course of the whole batch.
- The connection is released when closing the batch.

For BUFFER batch type the connection is claimed from the pool at the beginning of batch close, passed to PAP endBatch method and released afterwards.
- This means that the developer should buffer all documents in the course of batch. For this purpose, so called batch data buffer is available in PublisherBatch object.

The Publisher Framework also supports

Besides mentioned batch types PF supports also so called

multi server batches.
- Batch factory creates this kind of batch when more URL's are provided in the configuration.
- The purpose of this is to support the ability

of publishing

- to publish documents to more servers.
- Broadcasting and round robin are supported.

There is

also BatchAdapter

a BatchAdapter object available in PublisherBatch.
- This object can be used for reporting error and other messages to the Aspire framework.

Image Removed

Image Modified

Transformers

Transformers are used for transforming AspireObjects coming in jobs into some String format representation

of this object

required by the target repository. For example, when publishing to Elasticsearch

we

, you need to create a JSON structure of the Aspire document.

We support

The Publisher Framework supports XML, JSON and simple String transformers

Transformers are configured by specifying transform file – Groovy script for JSON or XSLT template for XML transformer.
Transform files are typically provided by the developer of the specific publisher.
- For example, the Elasticsearch publisher bundle is pre-packed with transform.groovy script.

In runtime the user

- In run-time, users can configure the publisher with

his

- their own transform file.
Transformer functionality can be used by calling the PublisherInfo.transform(AspireObject doc) method, which produces a string result of the transformation.

Note: For more low-level handling of the transformation process

there is a method

, use the PublisherInfo.getTransformerFactory

which can be used by developer for creating

method to create transformers and

using

use streams passed as parameters to transformers.

Image Modified

HttpClient

(HC) is

is provided by the HttpConnection object.

Whenever onedevelops the publisher

When developing a publisher for REST-based target repositories

he should

, consider using this class.

HC

HttpClient was primarily developed for writing AspireObject

documents

documents.
If required

HC HC

, HttpClient uses transformers for converting AspireObjects before writing.
HttpClient

supports REST-based API and can execute GET, PUT, POST, DELETE methods.
HttpClient

HC

also supports streaming.
- This can be used in batching. For example, Elasticsearch publisher writes single

documents to HC stream first and then

- documents to the HttpClient stream first. Then on batch close, this stream is posted to the Elasticsearch.

HttpClient

HC

can be configured by the HttpProperties object.

HC

HttpClient configuration is flexible enough to accept changes even after the object is constructed.
- This opens possibilities for reconfiguring already created and possibly already pooled objects.
- For example, we

can

- may need to modify an URL parameter value in already created

HC

- configuration because we need different

URL

- URLs for

normal

- bulk POST and other

for some actions like

- actions such as index clean.

HC

HttpClient supports retry logic

configured by specific parameters

.
HttpClient can

HC can

be configured

by HttpErrorHandler

by HttpErrorHandler.
- If this handler is provided, the developer can get information about possible connection errors or other Http related errors and

react

- act accordingly – either by throwing an exception or by continuing with retry logic.

Image Modified

Delete By Query

If a document with action

“deleteByQuery”

“deleteByQuery” arrives in a publisher

PF

, the Publisher Framework takes

an

appropriate action

.

The query document is first automatically transformed by the configured transformer

. The developer

if transformation is configured for the publisher.
You must support the transformation in a transformation script

– for example in case of JSON he can

.
- For example, in JSON, introduce a section

for this

- with

“if

- “if (action == "deleteByQuery")” command.

If

- Leave this section

is left

- empty if the deleteByQuery document

is considered

- should not

to in

- be transformed.

The developer must then

In the PAP class implement delete by query logic in the method processDeleteByQuery by interpreting the syntax of

“deleteByQuery”

“deleteByQuery” document.
- When arriving in PAP.processDeleteByQuery(DeleteByQuery) the

DeleteByQuery

- DeleteByQuery object (the object where the original

“deleteByQuery”

- “deleteByQuery” document is wrapped) can be

translated

- translated by the supported Visitor objects into some meaningful string representation.
- The prepared visitor classes support delete by query format created by ArchiveExtractor utility (QueryForArchiveDefaultVisitorImpl).
- For example

,

- in the Elasticsearch publisher

we

- , you can create part of

Elasticsearch API command for getting

- an Elasticsearch REST request to get all documents with the same

“parentId” published previously and this way handle

- “parentId” published previously; hence, handling deletion of all the documents

of

- from the archive.

Image Modified

Simple File

This publisher

Simple File

(SF) comes as a part of PF

publisher is a "Hello world!" part of the Publisher Framework for learning purposes only.

SF

Simple File publishes all documents into a single file.

SF

Simple File was developed to help developers who are new to

PF

the Publisher Framework and

wants

want to learn how to develop and deploy

the

a specific publisher.

always

It is an advice when learning PF to

When learning the Publisher Framework, always build and deploy this publisher first and then after running the crawl

checking the

, check the result in the publisher output file.

Resources/dxf/publisher.xml is an example of how to create a DXF file with publisher specific parameters

Resources/aspire.properties

is

is an example how to use parameters

for merging

to merge and

hiding

hide general DXF coming from

PF itself with

the Publisher Framework with the specific DXF provided by the publisher.

Image Modified

Page tree

Versions Compared

Old Version 1

New Version Current

Key

General

Schema

Batching

Transformers

HttpClient

Delete By Query

Simple File

Page tree

Page History

Versions Compared

Old Version 1

New Version Current

Key

General

Schema

Batching

Transformers

HttpClient

Delete By Query

Simple File