Incremental aggregation
is the process of reading new and updated records from the source and adding the pre-aggregated
values to the current aggregated values and update the target. After
executing the ETL job, we will be getting the final aggregated values in the
target.
Let us see an
example which explains incremental aggregation. In this I am going to explain
incremental aggregation with only new transactions.
Say we have
transaction header as –
Txn_Hdr_Id
|
Cust_Id
|
Txn_Date_Time
|
Total_Sales
|
Txn_Hdr_1
|
C1
|
1-Feb-13 10:50:33
|
4590
|
Txn_Hdr_2
|
C2
|
1-Feb-13 11:29:05
|
3750
|
Txn_Hdr_3
|
C4
|
1-Feb-13 16:34:22
|
6790
|
Txn_Hdr_4
|
C5
|
1-Feb-13 17:20:45
|
90000
|
Transaction Details
-
Txn_Det_Id
|
Txn_id
|
Prod_Id
|
Qty
|
Sales_Price
|
Sale_Amount
|
Txn_Det_1
|
Txn_Hdr_1
|
P2
|
10
|
250
|
2500
|
Txn_Det_2
|
Txn_Hdr_1
|
P1
|
11
|
190
|
2090
|
Txn_Det_3
|
Txn_Hdr_2
|
P2
|
15
|
250
|
3750
|
Txn_Det_4
|
Txn_Hdr_3
|
P3
|
5
|
1358
|
6790
|
Txn_Det_5
|
Txn_Hdr_4
|
P4
|
100
|
900
|
90000
|
Say you have to
populate the target table which should hold the total quantities sold everyday
by each product. After the first load, the data in the target will be -
Id
|
Prod_Id
|
Txn_date
|
Total_Sold_Quantities
|
1
|
P1
|
1-Feb-13
|
11
|
2
|
P2
|
1-Feb-13
|
25
|
3
|
P3
|
1-Feb-13
|
5
|
4
|
P4
|
1-Feb-13
|
100
|
Say we have two
more transactions in the next run as follows –
Txn_Hdr_Id
|
Cust_Id
|
Txn_Date_Time
|
Total_Sales
|
Txn_Hdr_1
|
C1
|
1-Feb-13 10:50:33
|
4590
|
Txn_Hdr_2
|
C2
|
1-Feb-13 11:29:05
|
3750
|
Txn_Hdr_3
|
C4
|
1-Feb-13 16:34:22
|
6790
|
Txn_Hdr_4
|
C5
|
1-Feb-13 17:20:45
|
90000
|
Txn_Hdr_5
|
C3
|
1-Feb-13
18:30:37
|
9000
|
Txn_Hdr_6
|
C2
|
2-Feb-13
09:11:22
|
15000
|
Details for those
transactions are -
Txn_Det_Id
|
Txn_id
|
Prod_Id
|
Qty
|
Sales_Price
|
Sale_Amount
|
Txn_Det_1
|
Txn_Hdr_1
|
P2
|
10
|
250
|
2500
|
Txn_Det_2
|
Txn_Hdr_1
|
P1
|
11
|
190
|
2090
|
Txn_Det_3
|
Txn_Hdr_2
|
P2
|
15
|
250
|
3750
|
Txn_Det_4
|
Txn_Hdr_3
|
P3
|
5
|
1358
|
6790
|
Txn_Det_5
|
Txn_Hdr_4
|
P4
|
100
|
900
|
90000
|
Txn_Det_6
|
Txn_Hdr_5
|
P2
|
25
|
250
|
6250
|
Txn_Det_7
|
Txn_Hdr_5
|
P5
|
5
|
550
|
2750
|
Txn_Det_8
|
Txn_Hdr_6
|
P4
|
100
|
150
|
15000
|
If we observe the
data, we have new transactions in the transaction header table (which are
highlighted in blue color and 4 old transactions of previous run (which are
highlighted in red color). There are 25 more quantities sold for the
product P2 after the first run on 1st Feb2013 and there are new
transactions for products P5 and P4.
The logic in
performing incremental extraction when we are running the ETL job in the next
run should be -
·
We have to ignore old and unchanged
records.
·
Read new transactions.
·
Perform aggregations on the extracted
transactions.
·
Get pre-aggregated values from the
target.
·
If summarized data is available for
the product in the target, we have to add the current aggregated values to
pre-aggregated values and update the target. Otherwise, we have to insert those
records into the Summary table.
After the current
execution of ETL, we have to add current run’s P2 quantities (25) with old
value (25) and insert P4 and P5. Summary table will have the following data
-
Id
|
Prod_Id
|
Txn_date
|
Total_Sold_Quantities
|
1
|
P1
|
1-Feb-13
|
11
|
2
|
P2
|
1-Feb-13
|
50
|
3
|
P3
|
1-Feb-13
|
5
|
4
|
P4
|
1-Feb-13
|
100
|
5
|
P5
|
1-Feb-13
|
5
|
6
|
P4
|
2-Feb-13
|
100
|
Please note that we
have to perform incremental extraction to perform incremental aggregation.
I hope I have given
an overall picture of incremental aggregation. Your suggestions and Comments
are highly appreciated.
Happy Learning.
A great interview concept.
ReplyDelete