DW-BIAnalytics: June 2015

Saturday, 20 June 2015

Incremental Aggregation

Incremental aggregation is the process of reading new and updated records from the source and adding the pre-aggregated values to the current aggregated values and update the target. After executing the ETL job, we will be getting the final aggregated values in the target.

Let us see an example which explains incremental aggregation. In this I am going to explain incremental aggregation with only new transactions.

Say we have transaction header as –

Txn_Hdr_Id	Cust_Id	Txn_Date_Time	Total_Sales
Txn_Hdr_1	C1	1-Feb-13 10:50:33	4590
Txn_Hdr_2	C2	1-Feb-13 11:29:05	3750
Txn_Hdr_3	C4	1-Feb-13 16:34:22	6790
Txn_Hdr_4	C5	1-Feb-13 17:20:45	90000

Transaction Details -

Txn_Det_Id	Txn_id	Prod_Id	Qty	Sales_Price	Sale_Amount
Txn_Det_1	Txn_Hdr_1	P2	10	250	2500
Txn_Det_2	Txn_Hdr_1	P1	11	190	2090
Txn_Det_3	Txn_Hdr_2	P2	15	250	3750
Txn_Det_4	Txn_Hdr_3	P3	5	1358	6790
Txn_Det_5	Txn_Hdr_4	P4	100	900	90000

Say you have to populate the target table which should hold the total quantities sold everyday by each product. After the first load, the data in the target will be -

Id	Prod_Id	Txn_date	Total_Sold_Quantities
1	P1	1-Feb-13	11
2	P2	1-Feb-13	25
3	P3	1-Feb-13	5
4	P4	1-Feb-13	100

Say we have two more transactions in the next run as follows –

Txn_Hdr_Id	Cust_Id	Txn_Date_Time	Total_Sales
Txn_Hdr_1	C1	1-Feb-13 10:50:33	4590
Txn_Hdr_2	C2	1-Feb-13 11:29:05	3750
Txn_Hdr_3	C4	1-Feb-13 16:34:22	6790
Txn_Hdr_4	C5	1-Feb-13 17:20:45	90000
Txn_Hdr_5	C3	1-Feb-13 18:30:37	9000
Txn_Hdr_6	C2	2-Feb-13 09:11:22	15000

Details for those transactions are -

Txn_Det_Id	Txn_id	Prod_Id	Qty	Sales_Price	Sale_Amount
Txn_Det_1	Txn_Hdr_1	P2	10	250	2500
Txn_Det_2	Txn_Hdr_1	P1	11	190	2090
Txn_Det_3	Txn_Hdr_2	P2	15	250	3750
Txn_Det_4	Txn_Hdr_3	P3	5	1358	6790
Txn_Det_5	Txn_Hdr_4	P4	100	900	90000
Txn_Det_6	Txn_Hdr_5	P2	25	250	6250
Txn_Det_7	Txn_Hdr_5	P5	5	550	2750
Txn_Det_8	Txn_Hdr_6	P4	100	150	15000

If we observe the data, we have new transactions in the transaction header table (which are highlighted in blue color and 4 old transactions of previous run (which are highlighted in red color). There are 25 more quantities sold for the product P2 after the first run on 1^st Feb2013 and there are new transactions for products P5 and P4.

The logic in performing incremental extraction when we are running the ETL job in the next run should be -

· We have to ignore old and unchanged records.

· Read new transactions.

· Perform aggregations on the extracted transactions.

· Get pre-aggregated values from the target.

· If summarized data is available for the product in the target, we have to add the current aggregated values to pre-aggregated values and update the target. Otherwise, we have to insert those records into the Summary table.

After the current execution of ETL, we have to add current run’s P2 quantities (25) with old value (25) and insert P4 and P5. Summary table will have the following data -

Id	Prod_Id	Txn_date	Total_Sold_Quantities
1	P1	1-Feb-13	11
2	P2	1-Feb-13	50
3	P3	1-Feb-13	5
4	P4	1-Feb-13	100
5	P5	1-Feb-13	5
6	P4	2-Feb-13	100

Please note that we have to perform incremental extraction to perform incremental aggregation.

I hope I have given an overall picture of incremental aggregation. Your suggestions and Comments are highly appreciated.

Happy Learning.

Monday, 15 June 2015

SEQUENCE in SQLServer

Sequence is a user-defined schema bound object that generates numerical values at regular intervals. It is an auto incremented object introduced in SQL Server 2012. These numerical values can be generated in ascending or descending order a sequence of numeric values according to the specification with which the sequence was created. It can be configured to restart generating sequence numbers again when exhausted (reaches the maximum value). The range of values depends on the data type we are declaring the sequence.

Sequence is similar to identity column. But, there are many differences between them. You can find the comparisons in the following table –

Sno	Identity	Sequence
1	Identity is a table property.	User-defined Object.
2	It is associated to a column of a table.	Can be used in multiple tables. In other words, a single series of numbers can be shared between multiple tables and also can be shared between multiple columns of the same table.
3	It is generated when rows are inserted.	Next sequence number is generated before inserting the row by calling the NEXT VALUE FOR function. The sequence number is allocated when NEXT VALUE FOR is called even if the number is never inserted into a table.
4		The NEXT VALUE FOR function can be used as the default value for a column in a table.
5	Cannot be restarted with the range of values.	It can be restarted generating sequence numbers when it reaches the maximum value.

Syntax for creating SEQUENCE in SQL Server:

$Text Box: CREATE SEQUENCE [schema_name . ] sequence_name [ AS [ built_in_integer_type | user-defined_integer_type ] ] [ START WITH <constant> ] [ INCREMENT BY <constant> ] [ { MINVALUE [ <constant> ] } | { NOMINVALUE } ] [ { MAXVALUE [ <constant> ] } | { NOMAXVALUE } ] [ CYCLE | { NOCYCLE } ] [ { CACHE [ <constant> ] } | { NO CACHE } ] [ ; ]$

Arguments:

sequence_name : User-Defined Name for the sequence

[ built_in_integer_type | user-defined_integer_type ] :

A Sequence can be defined as any integer type. Following are the types can be used to define a sequence –

tinyint : Range 0 to 255

smallint : Range -32,768 to 32,767

int : Range -2,147,483,648 to 2,147,483,647

bigint : Range -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

decimal and numeric with a scale of 0.

Any user-defined data type (alias type) that is based on one of the allowed types.

If the data type is not defined while creating the sequnece, Since this is optional, SQL server will create it with the default data type as bigint.

[ START WITH <constant> ] :

This start value should be greater than or equal to minimum value and less than or equal to maximum value. Sequence number will be starting from this number. Default start value is the minimum value for ascending sequence objects and maximum value for the descending values.

[ INCREMENT BY <constant> ] :

This is used to increment or decrement value of the sequence object. This value is passed to NEXT VALUE FOR function for each call. If we want to decrement the values we have to give the negative value here. It cannot be 0(zero). This is optional and it will take 1 as default value.

[ { MINVALUE [ <constant> ] } | { NOMINVALUE } ] :

This specifies the boundary value for the sequence object. This is an optional property. If we are not specifying this, it will take the minimum value of the data type for the sequence object.

[ { MAXVALUE [ <constant> ] } | { NOMAXVALUE } ] :

This specified the other boundary value for the sequence object. This is an optional property. If we are not specifying this, it will take the maximum value of the data type of the object.

[ CYCLE | { NOCYCLE } ] :

This property specifies whether the sequence should restart from the minimum value or not(in case of ascending Sequence). If we specify CYCLE, once it reaches the maximum value, it will be restarted from the minimum value again. By default it is NOCYCLE. In this case, after reaching the maximum value, it throws an exception.

[ { CACHE [ <constant> ] } | { NO CACHE } ] :

This property is used to increase the performance of SEQUENCE object. This is helpful when we are inserting bulk records in a single shot. In this case, for every record NEXT VALUE FOR function is fired, generates a number and that value will be returned back to SQL. This degrades the performance of SQL. Instead of this multiple switches, we can improve the performance by generating set of values at the first invocation of the NEXT VALUE FOR function and keep it in memory and give instructions to SQL Server to use them. If there is any improper shutdown of the server, we will lose the unused numbers. This is disadvantage of this object. By default, it is NO CACHE.

I better leave this here and discuss in my next post.

Information schema view for querying the sequences is – sys.sequences.

Let us see how to use this sequence with examples.

Creating a sequence with Default Values:

SQL Server will take the default values for the properties, if we are not giving the options of sequence while creating it.

The statement below creates the sequence in ‘test’ schema.

CREATE SEQUENCE test.Test_Def_Seq

You can issue below query to know about the sequence.

SELECT * FROM [master].[sys].[sequences] where name='Test_Def_Seq'

You can see all the default values for the properties of Sequence.

Creating a sequence with user defined Properties:

Let us create a sequence which starts with 1 till 100 with the interval as 1 with cache of 10 numbers and cycle. Here is the query -

CREATE SEQUENCE test.Test_Seq

AS INT

START WITH 1

INCREMENT BY 1

MINVALUE 1

MAXVALUE 100

CYCLE

CACHE 10;

In this case sequence will start from 1 till 100 with the interval of 1. Once it reaches the maxvalue which is 100, it again start from minvalue which is 1. And it will generate 10 numbers and keep it in the memory.

Here is the query to get the next value from the sequence object.

SELECT NEXT VALUE FOR Test.Test_SEQ

Output is : 1

When you issue the above SQL statement, everytime it will go to information schema’s sequences tables to get the current_Value and increment value for Test_SEQ sequence, adds both of them and the result will be returned from the NEXT VALUE FOR function.

To create the sequence object, user should have the CREATE SEQUENCE privilege and DROP SEQUENCE to drop the sequence.

Here is the SQL to Drop the Sequence –

DROP SEQUENCE <Sequence_Name>

Query to drop Test_Seq sequence is –

DROP SEQUENCE Test.Test_SEQ;

Let us discuss more about the caches and the usage of Sequences in SQL statements in next post.

All the best and Happy Learning.