Easy Heading Free | ||||||
---|---|---|---|---|---|---|
|
The Parquet Job Summarizer Executor is able to process the content of a Parquet file and extract each of the rows and the table schemacan process the table data contained in an Aspire job and fetch the associated rows from an Elasticsearch index. Each extracted row will be processed by the summarizers attached to the job.
Temporary FilesThe Parquet Summarizer Executor allows to download the content of the file into a local temporary file to reduce memory usage.
Rows Filtering
Job Summarizers executor allows summarizing data based on the table structure contained by a job.
Example of supported table structure:
Code Block | ||
---|---|---|
| ||
{
"container": {
"repItemType": "aspire/folder",
"seed": {
"description": "s3",
"id": "a8c0c88a-d3b4-42fb-b27d-57137ab85154",
"type": "s3",
"properties": {
"tag1": "value1",
"seed": "/qa-s3-storage/test-level1/split container/",
"processSplitFiles": "true",
"usePrefixesForSplitCheck": "true",
"splitCheckPrefix": "part-"
},
"tags": [
"darwin"
]
},
"isContainer": "TYPE-NOT-PROVIDED",
"connectorSpecific": {
"skippedRows": "0",
"rowCount": "32622",
"childId": [
"/qa-s3-storage/test-level1/split container/part-00000-d91360fd-0995-4af2-9998-39454c778297-c000.parquet",
"/qa-s3-storage/test-level1/split container/part-00002-d91360fd-0995-4af2-9998-39454c778297-c000.parquet",
"/qa-s3-storage/test-level1/split container/part-00001-d91360fd-0995-4af2-9998-39454c778297-c000.parquet"
]
},
"title": "split container",
"url": "/qa-s3-storage/test-level1/split container/",
"samples": [{
"Column1": "text",
"Column2": null,
"Column3": 5,
"Column4": "text"
"Column5": "745286400000000"
},
],
"displayurl": "/qa-s3-storage/test-level1/split container/",
"crawlStart": "2022-06-07T19:58:20Z",
"ingestionEnd": "2022-06-07T19:58:54Z",
"submitTime": "2022-06-07T19:58:55+0000",
"ingestionStart": "2022-06-07T19:58:50Z",
"dataProfile": {
"columns": [{
"technical_tags": "OPTIONAL",
"nullCount": "0",
"column_type": "STRING",
"columnName": "Column1",
"uniqueCount": "50"
}, {
"technical_tags": "OPTIONAL",
"nullCount": "8472",
"column_type": "STRING",
"columnName": "Column2",
"uniqueCount": "154"
}, {
"technical_tags": "OPTIONAL",
"minValue": "0.0",
"maxValue": "33.0",
"meanValue": "11.41498260725533",
"nullCount": "8472",
"column_type": "INT32",
"stdDev": "3.785881246274845",
"columnName": "Column3",
"uniqueCount": "30"
}, {
"technical_tags": "OPTIONAL",
"nullCount": "0",
"column_type": "STRING",
"columnName": "Column4",
"uniqueCount": "3"
}, {
"technical_tags": [
"OPTIONAL",
"AdjustedToUTC",
"MICROS"
],
"column_type": "TIMESTAMP",
"columnName": "Column5"
}
]
}
},
"name": "data-container"
} |
The table structure must contain information regarding the columns, such as the type and name.
The table rows are extracted from an Elasticsearch index, there are two formats of supported rows:
Based on published unique values:
Code Block | ||
---|---|---|
| ||
{
"name": "column-value",
"value": {
"pctg": "0.04966430607927895",
"seedId": "a8c0c88a-d3b4-42fb-b27d-57137ab85154",
"count": "1620",
tableId "": "/qa-s3-storage/test-level1/split container/",
"value": "text",
"columnName": "Column1"
}
} |
Single level key-value objects:
Code Block | ||
---|---|---|
| ||
{
"Column1": "text",
"Column2": null,
"Column3": 5,
"Column4": "text"
"Column5": "745286400000000"
} |
The Job The Parquet Summarizer Executor has the option to configure a groovy script to filter which rows will be processed.
Example:
Code Block | ||||
---|---|---|---|---|
| ||||
// This script must return a boolean. // The references of the job, doc, component, row and table objects are available. // Javadoc references // Row (row) - http://{manager}/javadocs/com/accenture/aspire/services/summarization/Row.html // Table (table) - http://{manager}/javadocs/com/accenture/aspire/services/summarization/Table.html row.getBoolean("sensitive") == true |