Throughout my years working with MOM and Operations Manager 2007, periodically I hear complaints about Operations Manager not alerting on low disk space conditions, or that administrators are receiving false alerts. Just about every time I've been called upon for this type of issue, it turned out to be thresholds not being adjusted properly, not that Operations Manager didn’t do it’s job correctly.
Before I get into this deeply, I want to iterate the importance of having a good disk free space monitoring definition in place. I have seen so many companies struggle with disk free space monitoring, when they really don’t need to. The problem almost always starts with not having a good discussion around your free space requirements, defining the thresholds for server roles and types, and then executing on the design.
This is a basic requirement for monitoring operational health of every server role in your infrastructure. Whether we’re talking about file servers, database servers, web servers or application servers, it is a mistake to put this on the back-burner and not define your requirements as soon as possible for each server role.
Two types of monitoring
My standpoint from a disk space monitoring perspective is simple, and it is aligned with the intent and purpose of Operations Manager. It’s two-fold.
Reactive and Proactive
Although it may seem elementary, let me explain the difference between reactive and proactive monitoring, and how it relates to the Logical Disk Free Space Monitor.
There are two scenarios when it comes to state changes in monitors, and each of these can be paired up with either reactive or proactive type monitoring.
Two-State Monitor = Reactive Only
This monitor has only two states. Healthy is required for one of the states. The other state can be warning or critical. In my opinion, a two-state monitor almost always defines some type of reactive monitoring scenario. In other words, a component being monitored by a two-state monitor is either healthy, or an administrator needs to take immediate action in order to correct the problem. This is synonymous to ON and OFF. There is no period of time where this component is in a degraded state, but still functioning, that allows an administrator to take remediation actions to correct the issue before it worsens.
Three-State Monitor = Reactive and Proactive
This monitor has three states. Healthy, Warning and Critical. The rules are similar to the Two-State monitor, as far as Healthy and Critical states are concerned. However, there is an additional state that connotes a degraded condition. In a degraded condition, the service or component is still functioning, but there are problems on the horizon if the administrator doesn’t plan to take remediation actions at the earliest convenience.
With this additional Warning (or degraded) state, we lend another type of monitoring to our operational monitoring; Proactive. Although this borders on both Reactive and Proactive, this is still very much proactive, in my opinion, because the administrator is informed of a degraded condition before is turns critical.
How does this relate the Logical Disk Free Space monitor? Well, this is a Three-State monitor. Hence, we are provided with the best of both worlds from an operational standpoint. Both Proactive and Reactive.
Another part of Proactive monitoring is provided by the reporting feature in Operations Manager. This goes above and beyond the capabilities of having a monitor warn your staff of a degraded state. This arms you with the capability to perform trend analysis of your applications and hardware, allowing your company to use this information for planning and provisioning resources in your infrastructure.
I have been in my share of arguments around monitoring disk space, usually relating to general recommendations for the threshold types used in this monitor. One of the most heated arguments I’ve heard around these thresholds, is to only use one type of threshold; either the MB threshold or the Percentage threshold. My argument has always been to use both these threshold types, and not to generalize an entire IT infrastructure based on a single threshold type.
By using only one threshold type, I don’t see how anyone could encompass the array of disk sizes and different types of server roles in the environment, and define a disk free space monitoring solution using only one threshold type. In my opinion, using only one threshold type generalizes all the unique attributes that make up the infrastructure as a whole. All I ask is that you read this article before making a decision as to how you’re going to use this monitor.
I’ve done my time going through the ranks of systems administration. And this includes carrying a pager, and reacting to alerts from that pager, 24/7. This being the case, I know one thing for sure. And that is…
I do not want to be stirred out of a deep sleep, pulled away from my family or have my golf game interrupted, in order to check on an alert that was triggered, only to find there was plenty of free space on the server I was alert on.
Sound familiar? I bet it does.
If you answer yes to any of the below questions, your reactive thresholds are not adjusted correctly.
1. At the earliest convenience, do you adjust the threshold for that instance? Or, just disable monitoring for that drive and be done with it (I have seen this done).
2. Do you have a routine down, and you know exactly when that alert will trigger, so you auto-respond to that alert without actually checking it? Or have you started ignoring alerts altogether?
5. Do you end up just checking on that server every day when you come in and when you leave, and see that it’s grown by 100MB each day, just waiting to bring it up in a meeting to allocate more drive space?
Whatever the case may be, you know that this drive is not in a critical state and there is no need to be alarmed yet. Growth of that particular disk has always averaged around 100MB a day, and you know the SAN group will not allocate more space until it’s down to 10GB free.
Make your case
To the on-call admin wearing the pager, listen up. I’m offering this argument to you, so you can then present your ideas to the operations monitoring group.
First thing you’ll want to do is download the Logical Disk Free Space Monitor Calculator (attached to bottom of article). Also grab this query, to help map out what your current disk sizes look like. A method I often use is, plug in the largest disk size, the smallest disk size, and the average disk size in the the calculator. Then start playing with the thresholds in the calculator to determine your unique threshold requirements for both System and Non-System drives.
First things first. How does the Logical Disk Free Space monitor work, when using both the MB and % threshold types? Here’s how.
The moment BOTH thresholds are exceeded, the state of that monitor will change.
Some basics of the monitor. This monitor is targeted to each type of Windows Server (2000, 2003 and 2008). Just keep that in mind when adjusting thresholds.
This is a double-threshold, three-state monitor. However, being that there two types of thresholds (MB and %), there is actually four thresholds that need to be set for this monitor.
Go ahead and open up the monitor properties and take a peak at the thresholds. To do this, go to the Authoring space.
Click on Monitors, then click Scope.
Type Logical Disk in the Look for input box, and check all three targets (for each type). Then click okay.
If you expand each of the types, as shown in the image below for 2003 type, you’ll find the monitor. Do not confuse the Free Space monitor with the Availability monitor.
Open the properties of the monitor.
As you’ll see, these thresholds are also split into to types of drives; System and non-System. This may sound confusing, but it’s really quite simple and there is good reason for it. As you might expect, System type drives host the operating system. Non-System type drives are all other drives.
And here are the tabs showing the properties of the monitor.
The reason for the two types of drives is because, drives that host the operating system are usually well-defined with specific volume sizes. These drives usually do not fluctuate in free space. And if they do, we monitor that. But, the monitoring is generally much more strict and will match as closely as possible to a true warning or critical state for the operating system to function properly.
In other words, a System type drive with 500MB of free space is okay. This drive doesn’t need to generate an alert unless it drops below, for example, 200MB. That’s when we would actually do something to free up some space. That’s when we need to be paged. That truly warrants an alarm.
Out of the box, the System type drive thresholds are as follows.
Also by default, this monitor generates an alert when it changes to critical. What this means to you, is you’ll see a state change in the Operations Console when the drive hosting the operating system drops below 200MB. This state will persist, allowing you to catch this warning state in the console before it reaches critical state, or until someone moves some files off and creates more free space.
There is a state view specifically for monitoring Logical Disk free space in the Microsoft Windows Server node of the monitoring pane in the Operations Console. You can also create a view in My Workspace to spot check a specific set of servers for drives in a Warning state once each day. This is part of the proactive monitoring I mentioned.
So, when the drive hosting the operating system drops below 100MB, you’ll get a page and an alert in the Operations Console. Again, this is when action must be taken with urgency. Hence, critical or reactive.
Out of the box, the non-System type drive thresholds are as follows.
As far as non-System type drives, this is usually the tricky threshold that needs to be discussed with your operations team. This is when you can put my disk space calculator to use.
I’m not going to get into semantics about all the different server roles and make recommendations for types of server roles. I’ll just note that the type of server is an important factor in determining disk space monitoring requirements. For instance, database servers will usually have different disk space monitoring thresholds than file servers.
I will, however, be using a file share server role in an example. This is only to get you thinking in the right direction, and is not intended to be a recommendation.
The company has 40 Windows Server 2003 File Share Servers. The majority of these servers have a 40GB system drive, hosting the operating system, with the exception of a handful of servers that were installed in 2003. At the time, the standard build was a 20GB system drive.
For the file shares, most later model servers have one 800GB volume. There are quite a few servers with two 300GB volumes. Then there are a few older model servers, which have two or four 80GB volumes.
The questions that need to be answered are:
What is a warning state?
This is the state in which your administrators need to be informed of a degraded situation. At this state of the monitor, there is time to take action to resolve the issue before it turns into a critical state. In other words, this the proactive threshold.
What is a critical state?
This is the state in which your administrators need to be alerted of a critical situation. In this state, an alert will be raised in the Operations Console and a page will be sent to your on-call administrator. This state connotes an urgent issue, and action must be taken at once. In other words, this is the reactive threshold.
These questions need to be answered for both types of drives.
In your meeting with the operations monitoring team, these thresholds and state were discussed, and everyone agreed upon the following. Regardless of the size of the system drive, 20GB or 40GB, and considering the operating system drive usually doesn’t fluctuate, and the fact that nobody should be storing data on those drives anyway, a warning should be raised when free space drops to 500MB.
This should give administrators adequate elbow room to proactively monitor for warning conditions and take remediation actions at the soonest opportunity.
Everyone also agreed that we only need an on-call admin to be paged if a drive hosting the operating system drops below 100MB. This is considered critical, as this will affect operating system performance and render it unresponsive soon, and we want someone paged to move files off that drive immediately.
Using the calculator, you determine that the thresholds for the system drive should be adjusted as follows.
Note that only a single threshold needed to be adjusted. The critical MB threshold, by default, meets our requirements. And both the warning and critical % thresholds, by default, meet our requirements. We need to create an override, for the file share servers, only for the warning MB threshold.
Here’s what it looks like in the calculator.
Remember, our decision was based on MB thresholds only. We did not even care about % free space.
Given that 10% and 5%, for warning and critical, are well over our defined 500MB and 100MB, respectively, given our drive sizes, we don’t need to play with the % thresholds. Technically, these % thresholds will be exceeded on our 40GB drives at 4GB and 2GB, for warning and critical.
Remember that both MB and % need to be exceeded, in order for a state change to occur. So, again, we only need to create an override for the warning MB threshold. And that override setting is 500MB.
Remember, most later model servers have one 800GB volume. There are a few with two 300GB volumes. Then there are a few older model servers, which have two or four 80GB volumes.
As I mentioned earlier, these non-system drives are usually a bit trickier to find a good balance. This is because there is a vast difference in volume sizes, and we’re trying to wrap our heads around a happy medium.
In the meeting with the operations monitoring team, we discussed only using the % threshold, and setting it at 10% and 5% for warning and critical, respectively. This didn’t go over very well. Because, again, we don’t want to wake our on-call admin up in the middle of the night because there was only 40GB left on a file share. That’s not exactly an urgent issue. Plus, we already know about that server and we’re expecting addition drive space to be allocated on Wednesday. We knew this because we saw the state change in the Operations Console when that volume dropped to 80GB two weeks ago.
We discussed only using the MB thresholds, adjusting them to 20GB and 4GB, for warning and critical, respectively. This didn’t go over well, because we really don’t want to wake the on-call admin again when one of the smaller 80GB drives drops to 4GB free space. These are not high volume drives, and when they are out of space we plan to move that data off to a larger volume anyway.
Rather than jumbling with these numbers, you break out the calculator, plug in the volume sizes (800, 300 and 80GB), and start plugging in some threshold values. After a few iterations, everyone liked the following thresholds.
Notice in the middle columns in the calculator, that the 800GB drive changes state for both warning and critical on only the MB threshold value. The 80GB drive changes state for both warning and critical on only the % threshold. The 300GB actually will use the % threshold value for the warning state change, and the MB threshold value for the critical state change.
This is a great balance for these file share servers. Each size volume has an adequate warning threshold, to allow plenty of time to proactively monitor these warning states and take action at the earliest convenience.
This also generates a critical state, subsequently generating an alert in the Operations Console and paging the on-call admin. These are all truly critical states, that require immediate action.
This meets all our requirements to expedite warning and critical states appropriately. And, most importantly, you’re on-call admin will appreciate that we have a good definition around monitoring disk space. Now he’s taking these pages seriously, and isn’t bothered for non-critical conditions.
Using Views for Proactive Monitoring
With well defined thresholds around disk free space monitoring, allowing for ample time to take action without urgency, we can use the Logical Disk state view in the Operations Console to proactively monitor free disk space. Checking this state view once per day will be a part of the daily routine.
You can find this state view here.
What we’re looking for here are servers in a warning state. If you have hundred, or thousands of servers, you can make this easier to look at by sort by the State column header.
If you want a more targeted view, containing only file share servers in a warning state, you can create a new state view in My Workspace. Here’s an example of such view.
So, not only are we monitoring for reactive conditions, we are also proactively monitoring disk space by means of establishing well defined thresholds for the Logical Disk Free Space monitor.
Again, as I mentioned earlier, another important piece of proactive monitoring is the report feature in Operations Manager. We can take proactive measures much further by using the reporting component. This will give us even richer information, like trend analysis for future planning and provisioning of resources.
I hope now you have a good understanding of how this monitor works. Along with the given example, and the free space calculator, you should now be armed and ready to tackle these disk free space alerts that have been so troubling for so many…especially for those on-call administrators.