Here are some notes on "SQL Server 2008 Statistics" I took while attending an advanced class on SQL Server taught by Kimberly Tripp (http://sqlskills.com/AboutKimberlyLTripp.asp).
Please note that, although these notes were taken during the class, I might have added some of my own (mis)interpretation :-). Always check your facts on Books Online (I try to provide links when applicable). As with anything else you get from a blog, never use any of this in production before you thoroughly validate it a test environment. Please keep in mind that some of those will be hard to follow without the right pre-requisite knowledge and the right context. Reading the post from top to bottom will help.
Understanding the job of the cost estimator
- Table Scan (always an option)
- Useful Indexes (look at WHERE, JOIN, GROUP BY, ORDER BY, covering)
- Other options?
- Statistics = density information about a table kept by SQL
- Help SQL "estimate" how many rows a certain query will return
- Stored as a BLOB, but you can query the data using "DBCC SHOW_STATISTICS"
Not always right
- Statistics cannot be right for every case, or else it would be actual data
- By definition, it is not all the data, it's a sample.
- How frequently do you update? Depends on how the data is changing.
- Statistically, certain types of data are consistent over time.
- Unless that are some atypical events related to your data.
Could be statistically correct after update
- Look at the city column for an employee table for a company based in Seattle
- Think of the frequency for Seattle, Redmond (suburb) and Spokane (hours away)
- This would probably not change much (statistically) quickly
- High number of rows with low selectivity - this is harder.
Looking at stats
- Things to look in your statistics (DBCC SHOW_STATISTICS):
- How old are your statistics?
- How many rows sampled in the statistics (out of the total rows)?
- How many steps in the statistics (up to 201 steps in SQL Server 2008)?
- Steps = number of rows in your histogram.
- What is the density of that key?
- Density * # of Rows = average rows returned for a given value
- You also get the average length of the column.
- For every step: actual value (high for step), total rows, count of unique rows.
- Data could be sampled (not based on every row, but just a subset).
- Generating statistics is usually not so expensive, it's done with sampling
- The query optimizer will actually create statistics on the fly, as queries come in.
- Future queries will benefit from having created that.
- If you don't have the stats, SQL will not be sure of the usefulness of an index.
- AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS are ON by default.
Stats with no index?
- Statistics do make sense in cases when you don't have indexes on a column
- For instance, for helping with deciding how to do sorts
- For instance, in a JOIN, to figure out which side has a smaller set of rows.
Let SQL do it
- Let SQL Server do its thing with stats
- You may start with a query taking a long time, than it improves "by itself"
- This could be the results of statistic being autogenerated and helping you
- Careful: On Read-only database, SQL cannot create statistics
- You can also pre-create the statistics
- You could pre-create for every column of every table, but you probably shouldn't
- It probably helps generate them for every column of your non-clustered indexes
- You can do that using "sp_createstats 'indexonly', 'fullscan'"
Leave it on
- Leave autocreate/autoupdate on. It's fine in most cases
- For most cases, you better off leaving them on
- If you know what you're doing, you can manage it manually
- You can still leave it on, since it will do only what's missing
- In that case, use STATISTICS_NORECOMPUTE
- Having one specific "problem child" database with scheduled updates with full scans.
- Details at http://www.sqlskills.com/BLOGS/KIMBERLY/post/Auto-update-statistics-and-auto-create-statistics-should-you-leave-them-on-andor-turn-them-on.aspx
Stats get old
- After they are created, statistics will get old over time, as data gets updated
- You can use sp_updatestats to query
- You can also query sys.stats and sys.indexes
- You can use DBCC SHOW_STATISTICS (with HISTOGRAM?)
- Statistics are automatically updated if AUTO_UPDATE_STATISTICS are ON (default)
- They get updated when about 20% of the data changes
- In SQL 7, tracks row mod, immediately updates
- In SQL 2000, tracks row mod, updates when needed
- In SQL 2005, tracks column mod, updates when needed
- In SQL 2008, tracks column mod, updates when needed. New: Filtered stats.
- Even if stats need updating, you can use old stats if new ones are not ready
- To do that, use ALTER DATABASE dbname AUTO_UPDATE_STATISTICS_ASYNC ON
- Use with caution
- Sampling is usually OK, but could create issues if data is not evenly distributed
- You might miss something significant if your sampling does not give you enough
- You can disable autoupdate and update statistics manually using a full scan
- To do that "UPDATE STATISTICS... WITH FULL SCAN".
Very large sets
- For very large data sets, even a full scan (no sampling) might not be enough.
- Think a few billion sales records, looking at statistics for customer_number.
- In that case, you have 1000 large customers and 1 million small ones.
- With only 201 steps, you just can't get good statistics for both large and small.
- For those, consider partitioning into multiple tables to get better stats.
- In SQL 2008, consider filtered statistics (for instance, WHERE customer_type=Large).
- Automatically created for filtered indexes. Can be created manually.
- Statistics will only help
- However, it is always better to have the right indexes
- As usual, you can't optimize for every single query
- Pick your battles.
For details, see "Statistics Used by the Query Optimizer in Microsoft SQL Server 2005" at: