Keeping Track of Database Capacity -- Monitoring and Planning

As a DBA at MSCOM, I'm often asked how long our servers have before they run out of disk space. This is also one of the toughest questions we get. Any estimate depends on having a good forecast for user traffic and data volumes and even considering database internals like, when will index trees need to "branch out"? Oh, and do this for each table! I don't know about you, but I barely know what I'm having for dinner tonight, let alone how much web traffic MSCOM will have next month or next quarter.

Yet it's critically important to know this, because it impacts every application's availability and scalability. Plus there are generally long lead times for buying and deploying new hardware. So if we suddenly realize we are running out of space on any one of hundreds of servers, it can be a real problem.

Before I describe our solution, let me also mention capacity monitoring. Those of you who were around in the "prehistoric" days of SQL Server 6.5 will remember that all the data (including indexes) for a database were stored in one logical "data device". It was important to have a "database capacity" alert to monitor the fullness of this device, because if it hit 100 percent you got an 1105 error and everything pretty much stopped. That and a "log capacity" alert was all was needed. You just added a new file, or expanded an existing one, right up until the drives were full.

With SQL Server 7.0, 2000 and 2005 we now have to deal with filegroups. That 1105 error now gets thrown when any filegroup fills up. Of course, we want to use autogrowth when ever possible. But my point is, isn't it filegroup capacity that matters now, and not database capacity?

In MSCOM, like any other large data center, we have thousands of databases and tens of thousands of filegroups -- any one of which can cause an application outage if it runs out of disk space. Each filegroup stores a different set of tables and indexes, all growing or shrinking their data at different rates. That's a lot of moving parts. You can imagine it being a full time job keeping track of a relatively small number.

Well, none of the DBAs here wanted a full time job like that. So every day we schedule an insert of all of our filegroup sizes into a centralized data mart. The following script is an example of code you might use to gather that information:

SELECT

RTRIM(groupname) AS FileGroupName,

RTRIM(name) AS 'FileName',

RTRIM(filename) AS PhysicalFile,

CONVERT(numeric(19,5), size*8/1024) AS 'Allocated MB',

CONVERT(numeric(19,5), (CONVERT(numeric(19,5), FILEPROPERTY (name, 'spaceused'))*8/1024) ) AS 'Used MB',

100*convert(numeric(19,5), ((size - FILEPROPERTY (name, 'spaceused')) / CONVERT(numeric(19,5),size)) ) AS 'Percent Free'

FROM sysfiles f, sysfilegroups fg

WHERE f.groupid = fg.groupid

UNION

SELECT

'z. All', 'All', 'All',

SUM(CONVERT(numeric(19,5), size*8/1024)) AllocatedMB,

SUM(CONVERT(numeric(19,5), (CONVERT(numeric(19,5), FILEPROPERTY (name, 'spaceused'))*8/1024) )) UsedMB,

100*(1-((SUM(CONVERT(numeric(19,5), (CONVERT(numeric(19,5), FILEPROPERTY (name, 'spaceused'))*8/1024) )))/(SUM(CONVERT(numeric(19,5), size*8/1024)))))

FROM sysfiles f, sysfilegroups fg

WHERE f.groupid = fg.groupid

ORDER BY 1

Having the data stored in a table allows us to do some interesting things. We can easily find the fullest filegroups, just by sorting on the "used" space column. But the most useful data comes from comparing today's filegroup sizes with those measured 5 days ago. When we divide that by 5, we get a daily average growth rate. Divide again by the current remaining free space in the filegroup, and we get the average number of days before the filegroup runs out of space.

Operationally, all we have to do is sort on this "days remaining" column and filegroups running out of space soonest jump to the top of the list -- meaning we can proactively increase them manually or ensure that they can autogrow. Instead of agonizing over thousands of dynamically growing pieces, we only focus on the few becoming critical. Or the few with suspiciously high growth rates.

How does this help us with our original question of how to predict when a server will run out of space? We compute a server-level "days remaining" metric by adding up that server's filegroup growth rates and dividing by the remaining free drive space (which we gather by running xp_fixeddrives each day and inserting into the mart). So we can answer WHEN we will hit bottom. Let's say "Server-X" has 90 "days remaining". Knowing that gives me plenty of time to order new hardware, or get developers thinking seriously about purge processes. Or let's say someone asks, why is our data outgrowing our brand new, expensive server so fast? We can look at the filegroup growth rates and drill down to the tables and indexes causing the behavior in question.

I hope I've shown you how useful it can be to gather and analyze file growth data. Depending on your business needs, it can be simple or complex. I'd like to end my discussion by asking your opinion. Does monitoring overall database capacity matter any more, or is filegroup and server capacity all we really need to care about?