The Summarization Framework is a set of workflow components introduced in Aspire 5.0.3, the framework process and profiles tabular data like RDB tables or Parquet files. For each tabular document received, the framework extracts and process each of the rows and then for each row it processes each column value to generate a profile report of the whole table. At the end of the process, the data generated is added as additional metadata of the table document.

Components


The framework is split in two kinds of components: executors and summarizers. 

Summarizers


The summarizers are the components that process each of the rows and columns, creating the data profile that will be added to the table document at the end of the process. Each summarizer is specialized to gather different kinds of information. The information could be samples of the processed data or statistics like what is the minimum or maximum value for a numerical column. 

Executors


The executors are the components that know how to extract the rows from the table document and the schema of how the table is structured depending on the document type (RDB, Parquet, SAS, etc.) For each extracted row, the executor calls each of the configured summarizers.

How they work


For the summarizers and executors to work, they must be configured in a specific order in the workflow, with each one of the summarizers to be used added before the executor component.


Framework workflow steps


1. Each summarizer in the workflow attaches themselves to each document received, creating a chain of attached summarizers.
2. The executor fetches the table rows and the schema.
3. For each row of the table, the executor calls the attached summarizers.
4. The summarizers process each row received, gathering information for the table profile.
5. When all rows are processed, the summarizers return their profile to the executor.
6. The executor merges the results from all summarizers and adds them to the table document. 

  • No labels