You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »


Introduction


The Job Summarizer Executor is able to process the table data contained in an Aspire job and fetch the associated rows from an Elasticsearch index. Each extracted row will be processed by the summarizers attached to the job.

Job based summarization

The Job Summarizers executor allows to summarize data based on the table structue contained by a job.

Example of supported table structure:

{
  "container": {
    "repItemType": "aspire/folder",
    "seed": {
      "description": "s3",
      "id": "a8c0c88a-d3b4-42fb-b27d-57137ab85154",
      "type": "s3",
      "properties": {
        "tag1": "value1",
        "seed": "/qa-s3-storage/test-level1/split container/",
        "processSplitFiles": "true",
        "usePrefixesForSplitCheck": "true",
        "splitCheckPrefix": "part-"
      },
      "tags": [
        "darwin"
      ]
    },
    "isContainer": "TYPE-NOT-PROVIDED",
    "connectorSpecific": {
      "skippedRows": "0",
      "rowCount": "32622",
      "childId": [
        "/qa-s3-storage/test-level1/split container/part-00000-d91360fd-0995-4af2-9998-39454c778297-c000.parquet",
        "/qa-s3-storage/test-level1/split container/part-00002-d91360fd-0995-4af2-9998-39454c778297-c000.parquet",
        "/qa-s3-storage/test-level1/split container/part-00001-d91360fd-0995-4af2-9998-39454c778297-c000.parquet"
      ]
    },
    "title": "split container",
    "url": "/qa-s3-storage/test-level1/split container/",
    "samples": [{
        "Column1": "text",
        "Column2": null,
        "Column3": 5,
        "Column4": "text"
        "Column5": "745286400000000"
      }, 
    ],
    "displayurl": "/qa-s3-storage/test-level1/split container/",
    "crawlStart": "2022-06-07T19:58:20Z",
    "ingestionEnd": "2022-06-07T19:58:54Z",
    "submitTime": "2022-06-07T19:58:55+0000",
    "ingestionStart": "2022-06-07T19:58:50Z",
    "dataProfile": {
      "columns": [{
          "technical_tags": "OPTIONAL",
          "nullCount": "0",
          "column_type": "STRING",
          "columnName": "Column1",
          "uniqueCount": "50"
        }, {
          "technical_tags": "OPTIONAL",
          "nullCount": "8472",
          "column_type": "STRING",
          "columnName": "Column2",
          "uniqueCount": "154"
        }, {
          "technical_tags": "OPTIONAL",
          "minValue": "0.0",
          "maxValue": "33.0",
          "meanValue": "11.41498260725533",
          "nullCount": "8472",
          "column_type": "INT32",
          "stdDev": "3.785881246274845",
          "columnName": "Column3",
          "uniqueCount": "30"
        }, {
          "technical_tags": "OPTIONAL",
          "nullCount": "0",
          "column_type": "STRING",
          "columnName": "Column4",
          "uniqueCount": "3"
        }, {
          "technical_tags": [
            "OPTIONAL",
            "AdjustedToUTC",
            "MICROS"
          ],
          "column_type": "TIMESTAMP",
          "columnName": "Column5"
        }
      ]
    }
  },
  "name": "data-container"
}

the table structure must contain information regarding the columns such as the type and name.

Fetch rows from Elasticsearch

The table rows are extracted from an Elasticsearch index, there are two format of supported rows:

Based on published unique values:

{
  "name": "column-value",
  "value": {
    "pctg": "0.04966430607927895",
    "seedId": "a8c0c88a-d3b4-42fb-b27d-57137ab85154",
    "count": "1620",
    tableId "": "/qa-s3-storage/test-level1/split container/",
    "value": "text",
    "columnName": "Column1"
  }
}

Single level key-value objects:

{
  "Column1": "text",
  "Column2": null,
  "Column3": 5,
  "Column4": "text"
  "Column5": "745286400000000"
}

Rows Filtering

The Job Summarizer Executor has the option to configure a groovy script to filter which rows will be processed.

Example:

Row Filter
// This script must return a boolean.
// The references of the job, doc, component, row and table objects are available.
// Javadoc references 
// Row (row) - http://{manager}/javadocs/com/accenture/aspire/services/summarization/Row.html
// Table (table) - http://{manager}/javadocs/com/accenture/aspire/services/summarization/Table.html
row.getBoolean("sensitive") == true
  • No labels