Saturday 20 June 2015

Incremental Aggregation

Incremental aggregation is the process of reading new and updated records from the source and adding the pre-aggregated values to the current aggregated values and update the target. After executing the ETL job, we will be getting the final aggregated values in the target.

Let us see an example which explains incremental aggregation. In this I am going to explain incremental aggregation with only new transactions.

Say we have transaction header as – 
Txn_Hdr_Id
Cust_Id
Txn_Date_Time
Total_Sales
Txn_Hdr_1
C1
1-Feb-13 10:50:33
4590
Txn_Hdr_2
C2
1-Feb-13 11:29:05
3750
Txn_Hdr_3
C4
1-Feb-13 16:34:22
6790
Txn_Hdr_4
C5
1-Feb-13 17:20:45
90000

Transaction Details -
Txn_Det_Id
Txn_id
Prod_Id
Qty
Sales_Price
Sale_Amount
Txn_Det_1
Txn_Hdr_1
P2
10
250
2500
Txn_Det_2
Txn_Hdr_1
P1
11
190
2090
Txn_Det_3
Txn_Hdr_2
P2
15
250
3750
Txn_Det_4
Txn_Hdr_3
P3
5
1358
6790
Txn_Det_5
Txn_Hdr_4
P4
 100
900
90000

Say you have to populate the target table which should hold the total quantities sold everyday by each product. After the first load, the data in the target will be - 
Id
Prod_Id
Txn_date
Total_Sold_Quantities
1
P1
1-Feb-13
11
2
P2
1-Feb-13
25
3
P3
1-Feb-13
5
4
P4
1-Feb-13
100

Say we have two more transactions in the next run as follows – 
Txn_Hdr_Id
Cust_Id
Txn_Date_Time
Total_Sales
Txn_Hdr_1
C1
1-Feb-13 10:50:33
4590
Txn_Hdr_2
C2
1-Feb-13 11:29:05
3750
Txn_Hdr_3
C4
1-Feb-13 16:34:22
6790
Txn_Hdr_4
C5
1-Feb-13 17:20:45
90000
Txn_Hdr_5
C3
1-Feb-13 18:30:37
9000
Txn_Hdr_6
C2
2-Feb-13 09:11:22
15000

Details for those transactions are -
Txn_Det_Id
Txn_id
Prod_Id
Qty
Sales_Price
Sale_Amount
Txn_Det_1
Txn_Hdr_1
P2
10
250
2500
Txn_Det_2
Txn_Hdr_1
P1
11
190
2090
Txn_Det_3
Txn_Hdr_2
P2
15
250
3750
Txn_Det_4
Txn_Hdr_3
P3
5
1358
6790
Txn_Det_5
Txn_Hdr_4
P4
 100
900
90000
Txn_Det_6
Txn_Hdr_5
P2
25
250
6250
Txn_Det_7
Txn_Hdr_5
P5
5
550
2750
Txn_Det_8
Txn_Hdr_6
P4
100
150
15000

If we observe the data, we have new transactions in the transaction header table (which are highlighted in blue color and 4 old transactions of previous run (which are highlighted in red color). There are 25 more quantities sold for the product P2 after the first run on 1st Feb2013 and there are new transactions for products P5 and P4.

The logic in performing incremental extraction when we are running the ETL job in the next run should be -
·         We have to ignore old and unchanged records.
·         Read new transactions.
·         Perform aggregations on the extracted transactions.
·         Get pre-aggregated values from the target.
·         If summarized data is available for the product in the target, we have to add the current aggregated values to pre-aggregated values and update the target. Otherwise, we have to insert those records into the Summary table. 

After the current execution of ETL, we have to add current run’s P2 quantities (25) with old value (25) and insert P4 and P5. Summary table will have the following data - 
Id
Prod_Id
Txn_date
Total_Sold_Quantities
1
P1
1-Feb-13
11
2
P2
1-Feb-13
50
3
P3
1-Feb-13
5
4
P4
1-Feb-13
100
5
P5
1-Feb-13
5
6
P4
2-Feb-13
100

Please note that we have to perform incremental extraction to perform incremental aggregation. 
I hope I have given an overall picture of incremental aggregation. Your suggestions and Comments are highly appreciated.

Happy Learning.


1 comment: