You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Documentation Under construction


 

SSL Configuration

Secure access can be enabled in the Staging Repository to restrict access to the REST APIs through https connections using client certificates to authenticate and authorize the access.

To enable secure access the application requires a valid server certificate/private key for the server hosting the Staging Repository (registered to the name of the server) and a certificate authority (CA) certificate.

In the configuration specify the location of the certificate files and the passphrase for the properties: keyLocation certLocation caLocation passphrase

The clients accessing the application REST APIs throught https will need to have a valid client certificate provided by the configured CA. To restrict access to specific client certificates from the CA, add the Common Name (CN) of the client certificates to the list property authList in the configuration.

authList :[
    'Aspire',
    'testuser'
]

CRDP-65: Option to allow all valid client certificates to access by adding '*' to the authList.

Encryption and Key Managers

All content stored in the Staging Repository is encrypted with the NodeJS crypto library. Content is encrypted with aes-256-cbc algorithm using an Initialization Vector (IV) and a Data Encryption Key (DEK) provided by a plugable key manager.

The Staging Repository provides 3 key manager implementations: * Basic: Uses a single master DEK set as a configurable parameter.

Configuration:

keyManager:{
    type:'basic',
     basic:{
         masterKey:'MTIzNDU2Nzg5MDEyMzQ1Njc4OTAxMjM0NTY3ODkwMzI='
     }
 }
  • File Based Master key: a file containing a list of master keys to encrypt the DEKs that will be used to encrypt content. There will be a finite number (configurable) of DEKs per Storage Unit that will be stored in a Mongo database (DEK). The DEK table will storage the encrypted DEK, the version of the master key and the IV used to encrypt the DEK. The Master Key file location is set as a configurable parameter of this key manager.

File Example:

MTIzNDU2Nzg5MDEyMzQ1Njc4OTAxMjM0NTY3ODkwMTE=   9
?MTIzNDU2Nzg5MDEyMzQ1Njc4OTAxMjM0NTY3ODkwMTI    5
?MTIzNDU2Nzg5MDEyMzQ1Njc4OTAxMjM0NTY3ODkwMTM    7

Configuration:

keyManager:{
    type:'filebased',
    keysNumber: 1000,
    filebased:{
        masterKeyLocation: 'config/MasterKey.txt'
    }
}
  • Hadoop KMS: uses Hadoop Key Management Server for DEK encryption. Based on a master key from KMS, the key manager uses this to generate new keys that will be used to encrypt the DEKs. There will be a finite number (configurable) of DEKs per Storage Unit that will be stored in a Mongo database (DEK). The DEK table will store the encrypted DEK, the iv, the master key and a proxy key/iv pair from KMS that were used to encrypt the DEK.

Configuration:

keyManager:{
    type:'clouderakms',
    keysNumber: 1000,
    clouderakms:{
        masterKey:'master_key_1',
        server: 'server-name',
        port: '16000',
        user: 'hdfs',
        sslEnabled: true,
        sslOptions: {
            keyLocation: './config/sslcerts/kms/sr_client_key.pem',
            certLocation: './config/sslcerts/kms/sr_client_cert.crt',
            caLocation: './config/sslcerts/kms/cacert.pem',
            passphrase: 'sibiu7$',
            requestCert: true,
            rejectUnauthorized: true
        }
    }
}

To create a custom key manager see.

Content Compression

The Staging Repository provides the option per Storage Unit to compress the content that will be stored in the database. Uses the zlib library from NodeJS.

Content is compressed at the scope level and a isCompressed tag is added to each compressed content scope.

The option can be enabled/disabled per Storage Unit through the administration API.

GET admin/enableContentCompression/<storage-unit>/<true-false>

Content Processing

Storage Unit has a set of events that are triggered during different actions performed while interacting with the Storage Unit. Content processing modules can be configured to be executed on these events. A content processing module is a javascript file that implements one or more events.

There are two different types of events: per document events and general events.

Per document events are triggered for each record content scope that is added, updated, deleted or fetched.

  • PreAdd: this event is triggered before the content scope is stored (added or updated) in the Storage Unit.
  • Process: this event is triggered after the content scope is stored (added or updated) in the Storage Unit. This event is also triggered when a record is reprocessed (see Reprocess API).
  • PreDelete: this event is triggered before the content scope is deleted from the Storage Unit.
  • PostDelete: this event is triggered after the content scope is deleted from the Storage Unit.
  • Fetch: this event is triggered after the content scope is fetched from the Storage Unit.
  • User Defined Document Events: Users can define other types of document events by invoking a transaction/execute or transaction/batch calls with a custom action (see Transaction API) and a reference to the record key.

General events are triggered directly by the Storage Unit when specific operations occur.

  • BatchStart: this event is triggered when a new batch is created during content ingestion or content reprocessing. When a batch is created, a batch variable is added to the execution context which can be accessed by records ondocument events.
  • BatchEnd: this event is triggered when a batch is completed during content ingestion or content reprocessing.
  • User Defined General Events: Users can define other types of general events by invoking a transaction/execute or transaction/batch calls with a custom action (see Transaction API) and no record key.

The content processing javascript modules are placed inside the processing_modules folder of the Staging Repository server. Each module can implement one or more of the event functions. The name of the function needs to be the name of the implemented event. A module can have functions of both general and per document events.

Per document events receive five parameters:

  • key: The id of the record.
  • content: A Javascript Object with the content of the scope of the record that is being processed.
  • context: A Javascript Object with configuration variables and references to utility functions.
  • settings: A Javascript Object with the available configuration properties for the module.
  • callback: A callback function that should be called when the per document event function completes its execution. Callback return parameters are: errcontentcontext.
exports.Process = function(key, content, context, settings, callback){
    if (context.isBatch !== undefined && context.isBatch === true) {
        if (content) {
            context.batchArray.push({index: {_index: context.elasticSearch.index, _type: context.elasticSearch.type, _id: key}});
            context.batchArray.push(content);
        }
        callback(null, content, context);
    } else {
        initialize(settings, function(client, index, type) {
            client.index({
                index: index,
                type: type,
                id: key,
                body: content
            }, function (err) {
                client.close();
                callback(err, content, context);
            });
        });
    }
}; 

General events receive three parameters:

  • context: A Javascript Object with configuration variables and references to utility functions.
  • settings: A Javascript Object with the available configuration properties for the module.
  • callback: A callback function that should be called when the general event function completes its execution. Callback return parameters are: errcontext.
exports.BatchStart = function(context, settings, callback){
    initialize(settings, function(client, index, type) {
        context.batchArray = [];
        context.elasticSearch = {};
        context.elasticSearch.client = client;
        context.elasticSearch.index = index;
        context.elasticSearch.type = type;
        callback(null, context);
    });
};

The admin/setContentProcessingModules API call configures content processing modules and module settings for a storage unit. Content processing modules are configured per scope. A default list of modules can be configured for scopes that are not explicitly defined. Each module can define its own list of settings. When executing the events for each module, these will be executed in the order in which they appear in the configuration array.

Content processing modules configuration consists of lists of modules for each scope and general settings.

{
    "modules" : {
        "connector": [ 
            {
                "module" : "FieldMapping"
            },
            {
                "settings" : {
                    "elasticsearch-index" : "aspiredocs",
                    "elasticsearch-type" : "aspiredoc"
                },
                "module" : "ESPublisher"
            }
        ], 
        "index" : [ 
            {
                "module" : "FieldMapping"
            }, 
            {
                "settings" : {
                    "elasticsearch-index" : "researchdocs",
                    "elasticsearch-type" : "researchdoc"
                },
                "module" : "ESPublisher"
            }
        ],
        "research" : [ 
            {
                "module" : "NormalizeCategory"
            }
        ]
    },
    "settings" : {
        "elasticsearch-port" : 9200,
        "elasticsearch-server" : "localhost"
    }
}

In this example the connector scope will execute events from AspireFieldMapping and ESPublisher in that order; index scope will execute FieldMapping and ESPublisher and; research scope will execute NormalizeCategory. For each event, the application will look for an implementation of that event on each content processing module, if it is available it will execute the event and move to the next module to find the same function event to execute and so on.

General events will usually be used to initialize common configuration to be used by document events, this configuration can be set in the context variable and it will be shared with all document events that belong to the same general event, for example all documents that belong to a batch will receive the same context variable that BatchStart returns, and BatchEnd will receive this same context variable with any modifications that could have been done by other events to the data/configuration of the context variable.

Foreign Key Joins

The Staging Repository provides a content processing module ForeignKeyJoin for automatic merging of records from different storage units based on record keys. 1-to-N relations can be specified between storage unit records. If configured,ForeignKeyJoin will run on Process and Fetch events of each document for the specified scope.

The records content scope needs to define a field in the content JSON with the format:

"foreignKeys": {
    "FOREIGN_KEY_NAME_1": {
        "storageUnit": "FOREIGN_STORAGE_UNIT_NAME",
        "scope": "FOREIGN_SCOPE",
        "ids": [
            "RECORD_ID_1",
            "RECORD_ID_2",
            ...
            "RECORD_ID_N"
        ]
    },
    "FOREIGN_KEY_NAME_2": {
        "storageUnit": "FOREIGN_STORAGE_UNIT_NAME",
        "scope": "FOREIGN_SCOPE",
        "ids": [
            "RECORD_ID_1",
            "RECORD_ID_2",
            ...
            "RECORD_ID_N"
        ]
    },
    ...
    "FOREIGN_KEY_NAME_N": {
        "storageUnit": "FOREIGN_STORAGE_UNIT_NAME",
        "scope": "FOREIGN_SCOPE",
        "ids": [
            "RECORD_ID_1",
            "RECORD_ID_2",
            ...
            "RECORD_ID_N"
        ]
    },
}

Output:

"PRIMARY_RECORD_SCOPE":{
    ...
    ...
    ...
    FOREIGN_KEY_NAME_1:[
        {
            RECORD_ID_1_FOREIGN_SCOPE_DATA
        },
        {
            RECORD_ID_2_FOREIGN_SCOPE_DATA
        },
        ...
        {
            RECORD_ID_N_FOREIGN_SCOPE_DATA
        }
    ]
}

Example:

  • Record with foreign key references:
"connector":{
    "url": "file:///server/myfolder/file1.txt",
    "content":"test content",
    "foreignKeys": {
        "acls": {
            "storageUnit": "DocAcls",
            "scope": "acls",
            "ids": [
                "1",
                "3",
                "7"
            ]
        }
    }
}
  • Foreign Key Records:
{key:"1", content:{acls:{"access": "allow","domain": "search","scope": "global","name": "user1","type": "user"}}},
{key:"2", content:{acls:{"access": "allow","domain": "search","scope": "global","name": "group2","type": "group"}}},
{key:"3", content:{acls:{"access": "allow","domain": "search","scope": "global","name": "group3","type": "group"}}},
{key:"4", content:{acls:{"access": "allow","domain": "search","scope": "global","name": "group4","type": "group"}}},
{key:"5", content:{acls:{"access": "allow","domain": "search","scope": "global","name": "group5","type": "group"}}},
{key:"6", content:{acls:{"access": "allow","domain": "search","scope": "global","name": "group6","type": "group"}}},
{key:"7", content:{acls:{"access": "allow","domain": "search","scope": "global","name": "group7","type": "group"}}}
  • Output:
"connector":{
    "url": "file:///server/myfolder/file1.txt",
    "content":"test content",
    "acls": [
        {
            "access": "allow",
            "domain": "search",
            "scope": "global",
            "name": "user1",
            "type": "user"
        },
        {
            "access": "allow",
            "domain": "search",
            "scope": "global",
            "name": "group3",
            "type": "group"
        },
        {
            "access": "allow",
            "domain": "search",
            "scope": "global",
            "name": "group7",
            "type": "group"
        }
    ]
}

To configure the ForeignKeyJoin module, add it to the scope's content processing configuration and make sure the scope content contains the foreignKeys field.

 POST admin/setContentProcessingModules/STORAGE_UNIT
{
    "modules": {
        "connector": [
            {
                "module": "ForeignKeyJoin"
            },
            ...
        ]
    }
}

 

With this configuration, foreign key merges will happen for the connector scope on any Process or Fetch event.

Publishers

ELASTIC SEARCH PUBLISHER

Content Processing example

Reprocessing Queue and Automatic Updates

Remote Replication

Feature Under construction

  • No labels