Incremental statistics … How to Update Statistics on 100TB Database


I really like the idea of Incremental Statistics and what it can offer customers, specifically around reduced maintenance times and I’ve been trying to find a real life example of how a customer has used this feature to help them manage super huge databases.

I was fortunate enough that one of our Brazilian colleagues has had such experience and blogged it on their own Microsoft blog.  Here is the original post:  https://blogs.msdn.microsoft.com/pfebrasilsql/2016/03/18/estatisticas-incrementais-ou-como-atualizar-estatisticas-em-base-de-100tb-vldb/.

As it’s in Portuguese I thought I'd take the opportunity to have it translated to English and posted up on our blog to highlight just how powerful incremental statistics can be in helping customers manage their huge databases!

Thanks to Sérgio Fonseca (PFE in Brazil) for this excellent post and also thanks to Alexandre Mendeiros (PFE in Portugal) for helping with the translation!

 

I am here to comment a case where the statistics were not updated frequently in a 100TB database on a risk analysis I made ​​in a client server. So it was suggested the use of incremental statistics (SQL 2014 or higher).

According to the documentation itself ( https://msdn.microsoft.com/en-us/library/ms190397(v=sql.120).aspx )

When new partitions are added to a large table, statistics should be updated to include the new partitions. However the time required to scan the entire table (FULLSCAN or SAMPLE option) might be quite long. Also, scanning the entire table isn't necessary because only the statistics on the new partitions might be needed. The incremental option creates and stores statistics on a per partition basis, and when updated, only refreshes statistics on those partitions that need new statistics.

So, I do not need to update using FULLSCAN in a giant table every day, you can, for example, update only the last two partitions.

We will see below how to apply this in your environment.

The first thing to do is to change the object definitions, so that new statistics are created as incremental.

ALTER DATABASE SANDBOX
SET AUTO_CREATE_STATISTICS ON ( INCREMENTAL = ON )

After that, for all partitioned tables where you want to make this change, you need to update the complete statistics in the object statistics.

Although this step is too heavy it will pay off later.

UPDATE STATISTICS [PartitionTable] ( [PK__Partitio__357D0D3ED0CC792E] ) WITH FULLSCAN , INCREMENTAL = ON

To ease a bit, I created some scripts to help in this process.

One of them just to apply in a table, another one to pick up all partitioned tables in a database.

——————————————————————————————————————————

–GENERATE INCREMENTAL UPDATE STATISTICS FOR 1 TABLE

SELECT 'UPDATE STATISTICS ' + QUOTENAME(object_name(object_id)) + ' (' + QUOTENAME(name) + ') WITH FULLSCAN, INCREMENTAL = ON'
FROM sys.stats
WHERE object_id = object_id('[dbo].[PartitionTable]')
AND is_incremental = 0

——————————————————————————————————————————

–GENERATE INCREMENTAL UPDATE STATISTICS FOR ALL PARTITIONED TABLES

SELECT 'UPDATE STATISTICS ' + QUOTENAME(object_name(S.object_id)) + ' (' + QUOTENAME(s.name) + ') WITH FULLSCAN, INCREMENTAL = ON'
FROM sys.tables T
INNER JOIN sys.indexes I
ON T.object_id = I.object_id
INNER JOIN sys.data_spaces DS
ON I.data_space_id = DS.data_space_id
INNER JOIN sys.stats S
ON I.object_id = S.object_id
WHERE DS.type = 'PS'
AND S.is_incremental = 0

After this first slower step, you just need to plan how your partitions will be updated.

In my scenario partitioning was by year-month and older partitions are rarely updated.

Then, we will update in a daily basis only the two last partitions.

This script will generate 3 types of update:

  1. UPDATE for non-partitioned tables.
  2. UPDATE for partitioned tables, but no incremental statistics
  3. UPDATE for partitioned tables with incremental statistics (last two partitions)

DECLARE @RESAMPLE BIT = 1
DECLARE @PERCENT_SAMPLE INT = 100 -- IF @RESAMPLE = 0 SET @PERCENT_SAMPLE
DECLARE @PROCESS_LAST_X_NONEMPTY_PARTITIONS INT = 2

---------------------------------------------------------

SELECT
SQL_COMMAND = 'UPDATE STATISTICS ' + QUOTENAME(T.Name) + ' (' + QUOTENAME(S.name) + ') '
+ IIF (@RESAMPLE = 1,
'WITH RESAMPLE',
'WITH SAMPLE ' + CONVERT(VARCHAR(3), @PERCENT_SAMPLE) + ' PERCENT')
+ CHAR(10)
,INFO = ' — NON PARTITIONED TABLES'
FROM sys.tables T
INNER JOIN sys.indexes I
ON T.object_id = I.object_id
INNER JOIN sys.data_spaces DS
ON I.data_space_id = DS.data_space_id
INNER JOIN sys.stats S
ON I.object_id = S.object_id
WHERE DS.type = 'FG' -- ROWS_FILEGROUP — NON PARTITION TABLE
AND S.stats_id IS NOT NULL -- FILTER TABLES WITH NO STATISTICS

---------------------------------------------------------

SELECT
SQL_COMMAND = 'UPDATE STATISTICS ' + QUOTENAME(T.Name) + ' (' + QUOTENAME(S.name) + ') '
+ IIF (@RESAMPLE = 1,
'WITH RESAMPLE',
'WITH SAMPLE ' + CONVERT(VARCHAR(3), @PERCENT_SAMPLE) + ' PERCENT')
+ CHAR(10)
,INFO = ' — PARTITIONED TABLES NON INCREMENTAL'
FROM sys.tables T
INNER JOIN sys.indexes I
ON T.object_id = I.object_id
INNER JOIN sys.data_spaces DS
ON I.data_space_id = DS.data_space_id
INNER JOIN sys.stats S
ON I.object_id = S.object_id
WHERE DS.type = 'PS' -- PARTITION_SCHEME — PARTITION TABLE
AND S.is_incremental = 0

---------------------------------------------------------

IF (OBJECT_ID ('tempdb..#TEMP_LAST2PARTITIONS') IS NOT NULL)
DROP TABLE #TEMP_LAST2PARTITIONS
;WITH AUX_LAST2PARTITIONS AS
(
SELECT
T.object_id
,TableName = T.Name
,I.index_id
,IX_Name = COALESCE(I.Name,'[HEAP]')
,P.partition_number
,P.rows
,i.data_space_id
,ROW_NUMBER = ROW_NUMBER() OVER ( PARTITION BY T.object_id, I.index_id ORDER BY P.partition_number DESC)
FROM sys.tables T
INNER JOIN sys.indexes I
ON T.object_id = I.object_id
INNER JOIN sys.data_spaces DS
ON I.data_space_id = DS.data_space_id
INNER JOIN sys.partitions P
ON I.object_id = P.object_id
AND I.index_id = P.index_id
WHERE DS.type = 'PS' -- PARTITION_SCHEME — PARTITION TABLE
AND P.rows > 0 -- filter empty partitions
)
SELECT
* INTO #TEMP_LAST2PARTITIONS
FROM AUX_LAST2PARTITIONS
WHERE [ROW_NUMBER] <= @PROCESS_LAST_X_NONEMPTY_PARTITIONS

CREATE CLUSTERED INDEX #IX_TEMP_LAST2PARTITIONS ON #TEMP_LAST2PARTITIONS (object_id, index_id, partition_number)

;
WITH AUX AS
(
SELECT
AUX.object_id
,AUX.TableName
,AUX.index_id
,AUX.IX_Name
,StatsName = S.name
,AUX.partition_number
,AUX.rows
,PARTITION_VALUE = ISNULL(CAST(left_prv.value AS VARCHAR(MAX))+ CASE WHEN pf.boundary_value_on_right = 0 THEN ' > '
ELSE ' >= '
END , '-INF > ')
+ 'X' + ISNULL(CASE WHEN pf.boundary_value_on_right = 0 THEN ' >= '
ELSE ' > '
END + CAST(right_prv.value AS NVARCHAR(MAX)), ' > INF')
FROM #TEMP_LAST2PARTITIONS AUX
INNER JOIN sys.stats S
ON aux.object_id = S.object_id
LEFT JOIN sys.partition_schemes ps
ON aux.data_space_id = ps.data_space_id
LEFT JOIN sys.partition_functions pf
ON ps.function_id = pf.function_id
LEFT JOIN sys.partition_range_values left_prv
ON left_prv.function_id = ps.function_id
AND left_prv.boundary_id + 1 = aux.partition_number
LEFT JOIN sys.partition_range_values right_prv
ON right_prv.function_id = ps.function_id
AND right_prv.boundary_id = aux.partition_number
WHERE S.is_incremental = 1
)
SELECT SQL_COMMAND = 'UPDATE STATISTICS ' + QUOTENAME(TableName) + ' (' + QUOTENAME(StatsName) + ') WITH RESAMPLE ON PARTITIONS (' + CONVERT(VARCHAR(20), partition_number)
+ ')' + CHAR(10)
,INFO = ' — PARTITIONED TABLES INCREMENTAL LAST (' + CONVERT(VARCHAR, @PROCESS_LAST_X_NONEMPTY_PARTITIONS) + ') PARTITIONS ON (' + PARTITION_VALUE + ')'
FROM AUX
ORDER BY TableName, IX_Name, StatsName, partition_number desc

 

Below is an example of the commands:

clip_image001

Conclusion

Remember that this was an example where I just wanted to update the last 2 partitions, but each case is a case.

Just pick up these commands into a variable and schedule it to run with [ sp_executesql ]

In my scenario for one of the tables with 9 TB data + 18TB indexes (117 partitions) there was a reduction of time from ~ 3 days (Update statistics FULLSCAN applied to all statistics of the table) to ~ 1 hour (Update statistics with FULLSCAN for all statistics in the 2 last partitions of the table).

clip_image002

Comments (5)

  1. Thanks for translating the article, it is very interesting !

  2. Moiz Abdullah says:

    Thanks for the detailed post.
    Just wanted to point out that there is a major bug associated with incremental statistics that readers should be aware of.
    If an index is being rebuild for even a single partition WITH (ONLINE = ON ), they would lose the statistics for the entire index.

    https://connect.microsoft.com/SQLServer/feedback/details/1691231/statistics-disappear-after-online-index-rebuild-with-incremental-statistics

    1. Chris Lound says:

      Great callout there Moiz – thanks!

      If customers are experiencing this issue we would recommend contacting the support group to work on the relevant fix.

  3. Hello,

    Good job.
    Updating your 9To table data et 18 To indexes should use huge space in TempDb ?
    And what about memory granted, depending on number of rows in each partition ?

    In fact, we have a real problem with incremental statistics => we can’t fix the SAMPLING.
    Details are here :
    https://connect.microsoft.com/SQLServer/feedback/details/3081264

    We have month partition table.
    When we start with FULLSCAN no problem => not so much data.

    But… when the partition contains more than 10 billions rows you have to deal with huge memory granted and TempDB is growing abnormally.

    As Sql server engine fix the sampling (even if you have WITH SAMPLE XX PERCENT) and it’s impossible to FIX sampling rate BY partition (we can with compression) it’s difficult to deal with a global strategy when your data table start growing.

    As you said in your article, updating statistics is really fast when FULLSCAN.
    FULLSCAN is the only mode that applied MAXDOP to your statistics update with the level of your maxdop defined on your server.
    In all other case, MAXDOP=1 is applied which is quite slow…

    So, in order to improve incremental statistics to real world scenario we have :

    https://connect.microsoft.com/SQLServer/feedback/details/3081264
    and
    https://connect.microsoft.com/SQLServer/feedback/details/628971/add-maxdop-parameter-to-update-stats

    Thx Chris for your article.

  4. Neil P says:

    This is brilliant. Do you know when this will be supported in Azure SQL DWH? This would be a huge benefit to large databases > 1Tb.

Skip to main content