Understanding SQL Server Cardinality Estimations by Charting Histogram to Actual Values


Summary: Tony O'Grady, a Microsoft Premier Field Engineer based in the UK, helps us understand whether or not a SQL dataset is going to produce useful query statistics by looking for patterns in histogram values in comparison to actual values. He introduces SQL Server histograms and explains how SQL Server divides the histogram in ranges to statistically represent values from the underlying table column data. He also presents a simple scenario using a one column select statement, whereby he charts the actual values to their statistical representation, and shows how when these diverge it can lead to inaccurate cardinality estimates. This is the first in his series of 5 advanced posts on charting statistical data. Enjoy!


SQL Server LogoThe purpose of this post is to give an overview of how patterns in underlying data effect statistical histograms in SQL Server. By charting examples of histogram values against actual values we can look for patterns with the goal of making it clearer for an observer to determine if the dataset is going to produce useful query statistics.

This is the first in a series of 5 advanced posts on statistics that will look at charting statistical data under different scenarios. These posts are not intended to be an introduction to statistics and assume a base of knowledge of SQL Server statistics administration.

Why do we require statistics?

When you submit a query the SQL Server optimiser decides how to execute it. Even for a simple query there are a number of potential plans for the optimiser to choose from. SQL Server uses a cost based model to determine the operators to use in a query plan. The cost is based in statistical information

Where is statistical information stored?

Statistical information is stored in sysindexes and other internal tables accessible through sys.stats and sys.indexes. Histogram information is stored in a blob (statblob) and is accessible through DBCC SHOW_STATISTICS.

Sample Dataset:

Throughout this post I will use the example of SalesOderDetail_Big table. This table is derived from the SalesOrderDetail table in the AdventureWorks 2012 sample database. I increased the table to 10 times its original size:

 SELECT * FROM [AdventureWorks2012].[dbo].SalesOrderDetail_Big

SalesOderDetail_Big table

Figure 1

In Figure 1 above, running a select query against SalesOrderDetail_Big enables us to view the data for each column.

In this our first scenario we are going to look at a couple of simple SELECT statements against the SalesOrderDetailID column.

 SELECT [SalesOrderDetailID] FROM [dbo].[SalesOrderDetail_Big] WHERE [SalesOrderDetailID] = <value>

What are data patterns?

For this scenario, when referring to data patterns what we are primarily looking at are the number of distinct values and the number of rows for each of these values. This is for both single column and combinations of columns and is fundamental in determining cardinality when deriving query plans.

For example, if we look at the SalesOrderDetailID column we can use TSQL statements to determine the pattern in the data.

 SELECT [SalesOrderDetailID] 
 FROM [dbo].[SalesOrderDetail_Big] 

The SalesOrderDetailID column has 1213170 rows with values ranging from ranging from 1 to 121317

 SELECT COUNT(DISTINCT 
 [SalesOrderDetailID]) FROM [dbo].[SalesOrderDetail_Big]

The SalesOrderDetailID column has 121317 distinct values

         
 SELECT [SalesOrderDetailID], COUNT([SalesOrderDetailID])FROM [dbo].[SalesOrderDetail_Big]GROUP BY [SalesOrderDetailID]        

In our example each value in the SalesOrderDetailID column occurs 10 times (Range Rows)

If we chart the number of rows for each column value in Figure 2 we can see that for each column value there are 10 rows.

Chart of the number of rows for each column

Figure 2

Why do we need a histogram?

SQL Server makes use of a histogram because it would be so expensive to compute the required values each time a query is run that it would make the optimisation redundant. The histogram presents a statistical representation of column information to the optimiser. The histogram summarises values into ranges. Each range will contain specific information for the value of the range called the RANGE_HI_KEY as well as summary data for the values within each range.

What do the values in a histogram represent?

We can retrieve an example of histogram by running the following query:

 DBCC SHOW_STATISTICS ([dbo.SalesOrderDetail_Big],[uststs_SalesOrderDetailID] ) WITH HISTOGRAM

statistical representation of the data in the SalesOrderDetailID column

Figure 3

This histogram in Figure 3 is the statistical representation of the data in the SalesOrderDetailID column for our sample table. To explain what the relevant columns in the histogram mean we are going to look at our example queries. The first query selects the SalesOrderDetailID column where it equals the histogram RANGE_HI_KEY value of 121317

 SELECT [SalesOrderDetailID] FROM [dbo].[SalesOrderDetail_Big] WHERE [SalesOrderDetailID] =121317

When the sample query WHERE clause value is equal to a histogram RANGE_HI_KEY value, SQL Server will use the EQ_ROWS column in the histogram to determine the number of rows that are equal to 121317;  for our example above in figure 3 this is equal to 10 rows.

What about when a value do not equal as a RANGE_HI_KEY?

 SELECT [SalesOrderDetailID] FROM [dbo].[SalesOrderDetail_Big] WHERE [SalesOrderDetailID] =121316

To estimate the number of rows when our WHERE clause value is 121316 which is not represented by a RANGE_HI_KEY, SQL Server will use the AVG_RANAGE_ROWS column in the histogram, which in this case is 10.  This is because each step in a histogram represents a range of rows. This range covers the values equal or less than the RANGE_HI_KEY value and greater than the range RANGE_HI_KEY value of the previous step in the histogram. Taking our example, values from [2 – 121317] are represented by the histogram row where the RANGE_HIGH_KEY is 121317 and its AVG_RANAGE_ROWS value is 10.

Graphing the histogram values against actual number of rows:

In Figure 3 we take the graph in Figure 2 and expand it by plotting the relevant information from the histogram. We can see that 121317 values are represented by just 4 data points for this example.

Expanded Histogram

Figure 4

Because there is no divergence between histogram values and actual values, our queries do not suffer from cardinality estimation errors. These can be observed by looking at a query plan’s Actual vs. Estimated number of rows as shown in figure 5 below.

looking at a query plan’s Actual vs. Estimated number of rows

Figure 5

What happens when the histogram does not accurately represent the actual values?

The histogram can be inaccurate for a number of reasons, an example of which we will look at over the series of posts to follow. For the purposes of this post, to achieve an inaccurate histogram for my sample dataset I am going to update the statistics with a default sample:

 UPDATE STATISTICS [SalesOrderDetail_Big]

Once this completes we can check the number of rows sampled using

 DBCC SHOW_STATISTICS ([dbo.SalesOrderDetail_Big],[uststs_SalesOrderDetailID] ) WITH STAT_HEADER

Results of DBCC SHOW_STATISTICS ([dbo.SalesOrderDetail_Big],[uststs_SalesOrderDetailID] ) WITH STAT_HEADER

Figure 6

Looking at Figure 6, SQL Server has sampled 107010 rows which is roughly just over 10 percent of the total rows for the column

In the case of our example, looking at the histogram generated in Figure 7 we can see that it differs significantly from the histogram generated from a FULLSCAN in Figure 3.

 DBCC SHOW_STATISTICS ([dbo.SalesOrderDetail_Big],[uststs_SalesOrderDetailID] ) WITH HISTOGRAM

Figure 7

Figure 7

For editorial purposes, Figure 7 does not display all the rows of the histogram.

We can go ahead and plot all the rows for the histogram against the actual table column values. As shown in Figure 8 we now have a divergence between the actual values and the histogram representation of these values. This gives us a big picture view. We can use this to zone in to specific range values in the histogram

plot all the rows for the histogram against the actual table column values

Figure 8

For example, looking at Figure 8 and isolating one of the values 4754 we can quickly observe that the histogram values are not the same as the actual values.

Figure 9

Figure 9

How does this translate to a query?

Looking at value 4754 in Figure 9 the EQ_ROWS column for RANGE_HI_KEY 4754 has a value of 1. Running a query with a WHERE clause of 4754 we can see from the query plan in figure 10 that the actual and estimated number of rows are different and now we understand where the values are derived from.

 SELECT [SalesOrderDetailID] from [dbo].[SalesOrderDetail_Big] where [SalesOrderDetailID] =4754

Figure 10

Figure 10

Taking another value 4753 from the same range (RANGE_HI_KEY covers a range of values from 3766 – 4754 which if we look at Figure 7 and Figure 9 can be determined) we know that SQL Server will use the AVG_RANGE_ROWS value 20.04935 to estimate the number of rows.

From the query plan in Figure 11 we can see that the Actual number of rows is 10 and the estimate is 20.0494

 SELECT [SalesOrderDetailID] from [dbo].[SalesOrderDetail_Big] where [SalesOrderDetailID] =4753

Figure 11

Figure 11

Summary

In this first post we introduced the SQL Server histogram and explained how SQL Server divides the histogram in ranges to statistically represent values from the underlying table column data. We went on to look a simple scenario of a one column select statement charting the actual values to their statistical representation and showed how when these diverge they can lead to inaccurate cardinality estimates which we observe in query plans.

In part 2 of this series we will take a look at violate data and the effects this has on the accuracy of the histogram.


Written by Tony O’Grady; Posted by Frank Battiston, MSPFE Editor