Non-Clustered Column Store Index in SQL Server 2012 by Srikanth Manda

4213
row store

Hi Friends,

I would like to add a pretty good article on “Non-Clustered Column Store Index in SQL Server 2012” from my friend “Srikanth Manda”. Hope you will enjoy this.

Everyone agree the fact that hardware speed and capacity has increased past two or three decades, but disk I/O (Input/Output) or disk access or data transfer rate has not grown up to the expected level and is still the slow. One key point to remember, as time is moving forward the size of the database become larger and larger. Data present now would increase by almost 10 times in next 2 to 3 years from no. We have provide a technology in SQL Server which can be addressed this kind of data growth with Data Warehouses. Secondly, the size becomes bigger the query performance is also very critical. Customers would like to have a response like a inter active, they want to have large amount of data, they want to process the data and get the results in the query like attractive fashion. Thirdly, that we are seeing is Data Warehouse has become more like a commodity and provide Data Warehouse technology to masses. Finally, the amount of data in data warehouse (DWH) is growing tremendously day by day. When you want to retrieve (Query) data from Data Warehouse, it takes quite huge amount of time. This would degrade the performance of the Data Warehouse. All these issues can be addressed by Non-Clustered Column Store Index.
In the Article, We will learn about this new feature, how can we build this, how it is in SQL Server, how exactly the data is stored, what happens underneath the engine, how this improves performance of Data Warehousing Queries.
In any traditional relational DBMS, the data is stored as rows (B-Tree format). Like, Microsoft SQL Server stores rows in a page of size of 8 K. If you have a row of 10 columns, you store Row 1 , Row 2 and when page becomes full the page 8 K, then Row goes to second page and so on. This is how the data is stored, successfully formats and successfully for OLTP Workloads. For example consider the image below the data for ten columns for each row gets stored together contiguously on the same page and once the data is full and the row goes to second page.

What has changed is, instead of storing data in the row format other way to look out is can I store data in the Column Store format. For example, I have a table with C1 to C10 columns, instead of storing as rows will store as columns. Then we have storage as Column C1, C2… C10. When we store data in the column store format, we get very good compression. The reason is data from same column is of same type and domain, so it compresses very well. For example, A company is operating globally throughout the world. All the employees from India, there mention the Country as India. Similar, employee from US would mention as ‘US’ as Country. Here, Column with Country would be compressed because it is a repetitive pattern. This kind of opportunity is available in Column Store Format rather than Row Store Format.
In the Row Store Format, data stored for all ten columns C1, C2, C3, …., C10. If we want to retrieve only columns like C1,C3,C5. What happens in the Row Store Format is we need read/fetch data for the entire row of 10 columns then predicate is applied for the specified columns. But, in case of Column Store Format, we can fetch only the required columns i.e.; Columns C1,C2,C3 etc. In this case, it reduces I/O and data fits in memory with which you get much improved performance. You can improve how the query is processed using Column Store technology that gives much better response time.
If we create Non-Clustered Column Store Index, the data is stored in column format.

If we store data in column format, suppose we store 10 million rows, we cannot store all 10 million rows of column C1 as storage unit. What we do is we break those rows into smaller chunks, which we call have as row group.

We have grouped the rows of 1 million; call it as Row Group Chunk. In each Row Group which has 10 columns here and each column is stored in its segment. It would be 10 segments. The benefit of storing each column in segment, when I want to rows of columns C1, C2, then I just get segment for column C1, segment for column C2.

Note: Blue color box are nothing but segments.
Important Points to remember:
1) Row group
• set of rows (typically 1 million)
2) Column Segment
• Contains values from one column from row group
3) Segments are individually compressed
4) Each segment stored separately as LOB’s as Binary Format
5) Segment is unit of transfer between disk and memory

New Batch Processing Mode
1) Some of the more expensive operators(Hash Match for joins and aggregations) utilize a new execution mode called Batch Mode
2) Batch mode takes advantage of advanced hardware architectures, processor cache and RAM improves parallelism
3) Packets of about 1000 rows are passed between operators, with column data represented as a vector
4) Reduces CPU usage by factor of 10(sometimes up to a factor of 40)
5) Much faster than row-mode processing
6) Other execution plan operators that use batch processing mode are bitmap filter, filter, compute scalar
7) Include all columns in a ColumnStore Index

Batch Mode restrictions:
1) Queries using OUTER Join directly against ColumnStore data, NOT IN (Sub query), UNION ALL won’t leverage batch mode, will revert to row processing mode
Examples:
1) In this Demo, Creating two tables i.e.; one with regular index and other with Non-Clustered ColumnStore Index. Below is the scrip to create two tables
Table with Regular Index
CREATE TABLE [dbo].[FactInternetSalesWithRegularIndex](
[DummyIdentity] [int] IDENTITY(1,1) NOT NULL,
[ProductKey] [int] NOT NULL,
[OrderDateKey] [int] NOT NULL,
[OrderQuantity] [smallint] NULL,
[SalesAmount] [money] NULL
CONSTRAINT [PK_FactInternetSalesWithRegularIndex_ProductKey_OrderDateKey]
PRIMARY KEY CLUSTERED
(
[DummyIdentity] ASC,
[ProductKey] ASC
)) ON [PRIMARY]

Table with Non-Clustered ColumnStore Index

CREATE TABLE [dbo].[FactInternetSalesWithColumnStoreIDX](
[DummyIdentity] [int] IDENTITY(1,1) NOT NULL,
[ProductKey] [int] NOT NULL,
[OrderDateKey] [int] NOT NULL,
[OrderQuantity] [smallint] NULL,
[SalesAmount] [money] NULL
CONSTRAINT [PK_FactInternetSalesWithColumnStoreIDX_ProductKey_OrderDateKey]
PRIMARY KEY CLUSTERED
(
[DummyIdentity] ASC,
[ProductKey] ASC
)) ON [PRIMARY]

GO

2) Insert data into both tables. Here is the insert script
Insert Script for FactInternetSalesWithRegularIndex Table
INSERT INTO FactInternetSalesWithRegularIndex
(
ProductKey, OrderDateKey,
OrderQuantity,SalesAmount
)
SELECT
ProductKey,OrderDateKey,
OrderQuantity,SalesAmount
FROM [AdventureWorksDW2012].dbo.[FactInternetSales]

GO 50

Insert Script for FactInternetSalesWithColumnStoreIDX Table
INSERT INTO FactInternetSalesWithColumnStoreIDX
(
ProductKey, OrderDateKey,
OrderQuantity,SalesAmount
)
SELECT
ProductKey,OrderDateKey,
OrderQuantity,SalesAmount
FROM [AdventureWorksDW2012].dbo.[FactInternetSales]

GO 50

3) And finally I want to create a regular non-cluster index (on ProductKey and Salesamount columns) on the first table, and column store index on the second table, which will include ProductKey and Salesamount columns.

CREATE NONCLUSTERED INDEX [NC_FactInternetSalesWithRegularIndex_ProductKey_Salesamount]
ON FactInternetSalesWithRegularIndex
(ProductKey,Salesamount)
GO

CREATE NONCLUSTERED COLUMNSTORE INDEX [CS_FactInternetSalesWithColumnStoreIDX_ProductKey_Salesamount]
ON FactInternetSalesWithColumnStoreIDX
(ProductKey,Salesamount)

GO

4) Execution of Queries

When I ran the query with STATISTICS IO ON, I found stunning results (with significant performance) of using column store index vs regular index, as you can see below:

SET STATISTICS IO ON

Select ProductKey,sum(Salesamount)
from FactInternetSalesWithRegularIndex
GROUP BY ProductKey
ORDER BY ProductKey

Select ProductKey,sum(Salesamount)
from FactInternetSalesWithColumnStoreIDX
GROUP BY ProductKey
ORDER BY ProductKey

SET STATISTICS IO OFF

Result:

(158 row(s) affected)
Table ‘FactInternetSalesWithRegularIndex’. Scan count 5, logical reads 4339, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 1342 ms, elapsed time = 504 ms.

(158 row(s) affected)
Table ‘FactInternetSalesWithColumnStoreIDX’. Scan count 4, logical reads 34, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 47 ms, elapsed time = 27 ms.

Even the time required to run these two queries greatly varied, the queries with regular index took 1342 ms for CPU cycle and 504 ms as elapsed time vs just 47 ms for CPU cycle and 27 ms as elapsed time for the second query, which uses column store index

The relative cost of the second query (which uses column store index) is just 11% as opposed to the relative cost of first query (which uses regular index) which is 89%.

For column store index exclusively, SQL Server 2012 introduces a new execution mode called Batch Mode, which processes batches of rows (as opposed to the row by row processing in case of regular index) that is optimized for multicore CPUs and increased memory throughput of modern hardware architecture. It also introduced a new operator for column store index processing as shown below:

Restrictions:
1) Cannot be clustered
2) Cannot act as PK or FK
3) Does not include sparse columns
4) Can’t be used with tables that are part of Change Data Capture or FileStream data
5) Cannot be used with certain data types, such as binary, text/image, row version /timestamp, CLR data types (hierarchyID/spatial), nor with data types Created with Max keyword eg: varchar(max)
6) Cannot be modified with an Alter – must be dropped and recreated
7) Can’t participate in replication
8) It’s a read-only index
a. Cannot insert rows and expect column store index be maintained

What’s New in SQL Server 2014
Columnstore index has been designed to substantially increase performance of data warehouse queries, which require aggregation and filtering of large amounts of data or joining multiple tables (primarily performs bulk loads and read-only queries).
There were several limitations in SQL Server 2012, SQL Server 2014 overcomes them:
1) We can create only one non-clustered column store index which can include all or few columns of table in a single index on a table.
2) SQL Server 2014 has come up with an enhancement of creating Clustered Column Store Index.
3) SQL Server 2012, when we create a Non Clustered Column Store index then it makes table read only.
4) With SQL Server 2014, you can create a Clustered Column Store Index without any impact on the insertion on table. You can issue some INSERT, UPDATE, DELETE statements with a table with clustered column store index. No more workaround is required for writing data to a table with Non Clustered Column Store Index like drop the existing one and re-create the index.

Hope you enjoyed the post ..
Thanks,
Srikanth Manda