One of the great benefits of clustered Columnstore indexes in SQL Server 2014 is they are writeable, which means you can insert, update and delete data. This is all well and good, but I think there really should be an asterisk beside update* and delete* with a disclaimer that says something like “Deletes aren’t really deletes and updates aren’t really updates”.
While on the surface, SQL Server appears to be processing the deletes and updates normally, under the hood it’s really just modifying the existing rows and/or adding new ones.
First, let’s go over a quick summary of the anatomy of a clustered Columnstore index. The data in a clustered Columnstore index is stored in two parts. They are comprised of columnstore section and a rowstore section, which is referred to as the Deltastore.
What SQL Server does under the covers is create a Columnstore index for the data that already resides in the table and marks it as read-only. The Deltastore is used to handle all of the inserts and updates that are made to the clustered Columnstore index. The Delete Bitmap is used to mark rows as deleted.
So if you’re keeping score at home, this is how it works:
For INSERTS:
- Inserts are initially added to the Deltastore section.
- A background process (tuple mover) eventually compresses and adds those rows to the Columnstore section as new row group.
For DELETES:
- For deleted rows in the Columnstore section, SQL Server only marks them as “deleted” but doesn’t actually remove the row. Those deleted rows will remain around until the clustered Columnstore index is rebuilt.
- Deleted rows that are in the Deltastore are actually deleted.
For UPDATES:
- Updates are treated as a delete + insert. Which means the original version of the updated row is marked as deleted in the Delete Bitmap, and then the new row with the updated values is inserted into the Deltastore.
- Updated rows that are in the Deltastore are actually updated.
To fully understand how that Deleted Bitmap can affect your performance, let’s look at an example.
I am using the dbo.FactInternetSales table in the AdventureWorksDW2012 database. I used a script by Kalen Delaney to blow up the size of the table to just over 247 million rows and takes up about 2.6GB in disk space.
Using the script below, I created a clustered Columnstore index on the table.
CREATE CLUSTERED COLUMNSTORE INDEX CluCSidx_FactInternetSales ON dbo.FactInternetSales; GO
You can use the system catalog, sys.column_store_row_groups to monitor the number of “deleted” rows in a clustered Columnstore index. In our example, you can run the following query to see the number of total row and the number of deleted rows.
SELECT SUM(total_rows) AS total_rows ,SUM(deleted_rows) AS deleted_rows FROM sys.column_store_row_groups WHERE object_id = OBJECT_ID('FactInternetSales'); GO
Let’s now “delete” a bunch of rows. I used a script by Pinal Dave to randomly delete several million rows from the table. What’s left over is a table that hasn’t changed at all. Even though I just deleted about 65 million rows, the physical size is still the same; 2.6GB.
If we look at sys.column_store_row_groups, we can see that we’re now carrying those 65 million rows around as extra baggage, and this explains why the table size never changed.
SELECT SUM(total_rows) AS total_rows ,SUM(deleted_rows) AS deleted_rows FROM sys.column_store_row_groups WHERE object_id = OBJECT_ID('FactInternetSales'); GO
Let’s run a couple of sample queries and see what our performance is like.
SELECT COUNT(*) FROM dbo.FactInternetSales; GO
The results are not too bad. The count hit all rows in the table and still return in 12 seconds and only needed 1709 logical reads. Let’s run another query.
SELECT d1.SalesTerritoryRegion ,COUNT(*) AS InternetOrderQuantity ,ROUND(SUM(f1.SalesAmount),2) AS InternetSalesAmount FROM [dbo].FactInternetSales f1 INNER JOIN dbo.DimSalesTerritory d1 ON f1.SalesTerritoryKey=d1.SalesTerritoryKey GROUP BY d1.SalesTerritoryRegion ORDER BY InternetSalesAmount DESC; GO
Here we see similar results. This aggregation query took 2.3 seconds and had 4604 logical reads against FactInternetSales.
Although the performance of those queries was good, they are far from perfect. All of those deleted rows are carried around with the table until the clustered Columnstore index is rebuilt. During that process, all of the “deleted” rows are deleted for real, any rows in the Deltastore are moved into the Columnstore, and the row groups within the Columnstore are recompressed to be as small as possible.
Let’s see what happens to our index after rebuilding it.
ALTER INDEX CluCSidx_FactInternetSales ON dbo.FactInternetSales REBUILD; GO
Looking at sys.column_store_row_groups confirms that we have completely removed all of the delete rows.
We can check the size of the table and see the storage space needed is considerably smaller; about 682MB smaller.
Now let’s rerun our two queries and see if have any better performance.
SELECT COUNT(*) FROM dbo.FactInternetSales; GO
Wow! With only a few hundred fewer logical IOs, SQL Server was able to return the result in only 424ms. Compared to our earlier run of over 12 seconds, that’s a huge improvement.
SELECT d1.SalesTerritoryRegion ,COUNT(*) AS InternetOrderQuantity ,ROUND(SUM(f1.SalesAmount),2) AS InternetSalesAmount FROM dbo.FactInternetSales f1 INNER JOIN dbo.DimSalesTerritory d1 ON f1.SalesTerritoryKey=d1.SalesTerritoryKey GROUP BY d1.SalesTerritoryRegion ORDER BY InternetSalesAmount DESC; GO
Here we see similar performance gains. We have a few hundred fewer logical reads, and our runtime is down to just 905ms.
Clustered Columnstore indexes for SQL Server 2014 are huge win for any application that make use of them, but as you can see it does come at a cost if you routinely delete lots of rows and fail to monitor the deleted_rows value in sys.column_store_row_groups.
Check out Books Online for further reading about clustered Columnstore indexes.