How to measure Operations Manager database performance

In order to make any kind of performance tuning recommendations, it is first necessary to capture a baseline of normal operating conditions.  What I’d like to demonstrate in this post is how to measure the performance of your Operations Manager database disks.  The disks hosting the Operations Manager database must be able to comfortably sustain average read/write transactions, as well as handle the large bursts that occur frequently.  As with any OLTP database, the Operations Manager database is no exception in that there is a high volume of smaller, random read/write transactions.

The question that has come up with many customers is:

Is this configuration, in fact, enough to support my Management Group? Or, perhaps, is this over-built?

Before I spill these beans, I want to throw a disclaimer out there stating that I am not a hardware expert, and my expertise is not in SAN or RAID configuration.  I am by no means an authoritative subject matter expert in this area.  I just happened to know a little math, and have experience in collecting performance data and crunching those numbers.  Plus, I have some very smart colleagues (Michael Pearson from the Performance Test Team).  I also have to give credit to my wife, for helping me on the RAID 5 calculations. :-)

By the way, I’m not doing rocket science here.  Anyone can come up with these formulas, and run through a similar process.  I’m sharing this now because I haven’t found a single good resource regarding this.  I hope this post serves the community well.

The Beans

Here’s what I use to gauge the performance of a Operations Manager database server role.

Key

i = Disk IOPS
n = Number of disks in Array
%R = % Read IOPS
%W = % Write IOPS
Assumptions
10k Disks = 125 IOPS/disk
15k Disks = 175 IOPS/disk
IOPS Capacity for RAID 1 and RAID 10
(n/2)*i This is the maximum achievable IOPS for the Array.
IOPS Capacity for RAID 5
(n*i)/(%R+4*%W)
*RAID 5 IOPS capacity calculation is based on measured data. This is the maximum IOPS theoretically achievable for the array, based on sampled data.
Measured IOPS
(Avg. Disk Reads/sec) + (Avg. Disk Writes/sec)
% Read IOPS Calculation
(Avg. Disk Reads/sec) / (Measured IOPS)
% Write IOPS Calculation
(Avg. Disk Writes/sec) / (Measured IOPS)

 

And here’s an example of how to use these formulas.  In this example, I am using the database server role hosting the OperationsManager DB.  The performance samples in this example are relative to what I might see in an average customer environment.

 

Scenario Configuration

 

Disk Configuration

 

E 6-disk RAID 10, 15k
F 8-disk RAID 10, 15k
G 2-disk RAID 1, 10k

SQL Configuration

E TempDB – Data and Log files
F OperationsManager – Data file, System Databases – data and log files
G OperationsManager – Log file

Expected IOPS

Given the disk configuration above, we can calculate what we expect to get for IOPS.  Looking at the “assumptions” in the above table, we assume the disks comprising E and F are rated at 175 IOPS per disk, and the disks comprising G are rated at 125 IOPS.  So, for each array, the math looks like this.

E (6/2)*175 = 525 IOPS
F (8/2)*175 = 700 IOPS
G (2/2)*125 = 125 IOPS

Prelude to Measuring IOPS

When gauging performance, it’s always a good idea to calculate both average samples and maximum samples.  The reason for this is because intensive Read/Write applications need to have a sufficient disk subsystem to handle significant bursts of both Read and Write type IO, and Operations Manager is no exception.  If your disk subsystem is not built to handle these bursts, this can be the origin of a bottleneck in overall Management Group performance.

A quick note about capacity planning, and how it relates to estimating Read or Write type IO bursts.  This is difficult to determine without first knowing which type of IO a unique Management Group will produce more of.  Cumulative knowledge about making these types of estimates during a capacity planning and design stage can be applied, and a fair recommendation can be made.  But, again, I’m not going into capacity planning in this post, so I won’t wander off too far.

However, the only point I want to make is, until the Management Group is built out and all agents and Management Packs are installed, and the Operations Console has been delivered to your operations staff, a precise number for types of IO (Read or Write) and recommendations for Read / Write Cache ratio on your RAID controller cannot be made.

This is why it is important to gauge the performance of your Management Group after everything is in place, to verify the measured IO meets the expected IO.  If things don’t match up, further configuration “tweaks” can be made.

Measuring IOPS

If we sample Disk Reads/sec and Disk Writes/sec, for a period of time, we can calculate average IOPS as follows.  Remember that we’ll also be figuring Maximum into our review.

I recommend sampling these counters for each instance, for a 24 hour period, with at least 1 minute interval.  Of course, the more samples you collect, the more accurate the calculations will be.  However, I don’t think it’s necessary to sample more than once every 15 seconds.

Average

This is the average IOPS you are currently getting out of your configuration.  Using the formulas in the table above, the math looks like this.

E 37.5+555.42 = 592.92 Average IOPS
F 3.31+15.14 = 18.45 Average IOPS
G 1.05+59.24 = 60.29 Average IOPS

Maximum

E 8.2+1126.95 = 1135.15 Maximum IOPS
F 207.77+120.27 = 328.04 Maximum IOPS
G 152.49+129.42 = 281.91 Maximum IOPS

NOTE: Maximum is calculated using top IO sampled for each type during the sampling period.  So this is the theoretical maximum IOPS for the array.

Looking at the average and maximum IOPS and comparing them to the expected IOPS in the previous section, we can see that we are well covered for the array hosting the OperationsManager data file.  In fact, we aren’t seeing IO bursts that even come close to the expected IO capabilities for the array.

The array hosting the OperationsManager log file appears to be busier, as far as a consistent average goes.  But it’s still well within reason when comparing the expected IO for the array.  However, there is reason for concern with the Write bursts that are occurring on that array.

A major concern here is the array hosting the TempDB.  The total measured IOPS activity on that array exceeds the expected IO for the array.  Given this, and the fact that the Write IO burst that are double the expected IO capacity for the array, the array hosting the TempDB needs some serious attention.  This is likely causing quite a bottleneck on the Operations Manager database server role.

I would certainly add additional spindles to the array hosting the TempDB.  If, for some reason, this wasn’t possible due to budget constraints, I see another option is available.

Since the array hosting the OperationsManager database is humming along without coming close to consuming the available IO afforded by that array, we could borrow 2 disks from that array and add those to the array hosting the TempDB.

With this configuration, the F drive will still have plenty of room to perform all IO’s, including the IO bursts we’ve observed.  This will lend more performance to the E drive, where it is needed most.  Giving the E drive the additional 2 disks will allow TempDB to perform it’s average IO’s comfortably at ~592 IOPS.  The high Write bursts will still exceed maximum achievable IO for the array, but this is much better than the current configuration.

Measuring Types of IO

From the numbers in our measured ROIPS calculations, we can determine what type of IO transaction is happening more often.  This can help in determining what the Read / Write Cache Ratio should be adjusted to on the controller.  We can find this by calculating the Read or Write IO, against the total IO.  We’ll do this for both Average and Maximum.

Average

 

% Read

% Write

E

6

94

F

18

82

G

2

98

Maximum

 

% Read

% Write

E

1

99

F

63

37

G

54

46

Taking these percentages of IO types, we can determine a feasible Read / Write Cache Ratio on the controller.  As you can see, most IO is Write.  There’s only one exception.  And that is the Read type IO bursts on the array hosting the OperationsManager database.  This Read type IO is likely due to a high number of Operations Console sessions, navigating to multiple different views.

Because console sessions submit nothing but Read type IO to the OperationsManager database, I would surmise that this particular Management Group employs the use of the Operations Console quite heavily.