The performance team had a thread discussing IOPS requirement with a customer who had some really good questions so and I thought I would share the thread with you all. For sizing the Operations Manager Database and Data warehouse I would highly recommend trying out SCCP planning tool which will give you a lot of guidance for sizing your DB’s. Below are some questions we answered to customer on IOPS requirements which I think maybe useful to you. Also one of our MVPs wrote a great article on IOPS requirements for OpsMgr based on the research he had done which can be found here: http://wchomak.spaces.live.com/blog/cns!F56EFE25599555EC!610.entry
1) Does Ops Mgr 2007 have large sprocs and / or queries that are sensitive to timely completion? Do these sprocs and / or queries run on very large data sets (if so, what’s the potential range of sizes), and do they branch a lot or do other things that might generate a lot of random disk I/O?
The DB is an integral part of OpsMgr and part of the end to end operations – in that sense, there is no performance impact against OpsMgr due to sprocs and queries. For larger environments like this, one of the largest database queries will happen during a Configuration change (Importing an MP, making an override on common MP, etc). The amount of IO needed to perform this query will depend on the number of instances you are monitoring. That being said, this is not a very frequent operation, unlike say pulling the Active Alerts View in the Operations Console.
2) Do these sprocs and / or queries generally tailor better to small-block random disk I/Os (4 – 8KB), medium block random disk I/Os (16 – 32KB), or large random disk I/Os (64+KB)? Do any of the sprocs tailor better to sequential disk I/O?
SQL Server’s most basic IO happens in 8K Pages. You should see 8K random and sequential IO’s depending on the level of fragmentation of you tables and indexes. OpsMgr has daily tasks to defrag and reindex tables and indexes as needed. For more information, take a look at this document from the SQL Server 2005 Books Online: http://msdn.microsoft.com/en-us/library/ms190969.aspx One advantage of OpsMgr is that we have separated the Operational and Reporting databases. This allows the “long / expensive” queries to run outside of the Operational data, so that running reports doesn’t impact data coming in, unless the databases area on the same SQL server and physical disks).
3) Can you provide additional details about why Ops Mgr 2007 scalability is limited to 6,000 agents? We need to understand what limits Ops Mgr to 6,000 agents, whether it is OS, database, networking, threading, memory, or other limitation?
We recommend 50 UI Consoles & 6000 Agents in a single MG, as it may cause system bottleneck in RMS/DB beyond that point.
4) What is the system bottleneck in the RMS/DB that you refer to? Is it based upon the DB software, lock contention, a disk I/O shortage, something else?
The bottlenecks tend to be Memory and CPU on the RMS for the three OpsMgr services running there (mainly ConfigService), and Database IO on both the OpsDB and OpsDW. Depending on the number of Agents, Consoles, and MP’s installed, these bottlenecks maybe be a bit more or less severe, but they are the main bottlenecks for overall system scalability.
5) What I/O size (4K, 8K, 16K, etc?) must be sustained at 125 IO ops?
Since the pure number of IO’s has a much bigger impact than the size of the IOs, the size is less important. This number (125 random IOs/sec) is a rule of thumb we use, though faster disks with smaller write sizes may exceed this number, though it would be marginal.
6) How was this requirement for 125 IO ops derived (please be as specific as possible)?
This is based on a typical 10K-15K RPM SCSI disk, with completely random reading/writing. (Sequential IOs and even random/sequential IO mixes will be faster.) If it is not clear already, this number gets doubled (250 IOps) when you have a 4-disk RAID 10 array since you get two disks (and their respective mirrors) working in parallel.
7)Does this rule hold for every type of RAID (1, 5, 1+0, etc)?
For RAID 0, it gets multiplied by the number of disks you have in your array. For RAID 10, you’d multiply it by half the number of disks you have in your array (since half of your disks are used for mirroring). For RAID 1, you’d get no improvement in performance (RAID 1 only provides added redundancy – no performance benefit), so the same rule would hold true here. RAID 5 (which we don’t highly recommend, by the way, since it mostly benefits disk reads, but less so for writes), also gets some benefit, but it will be mostly for disk reads.
8) When considering a 14 drive 1+0 array, does it assume 14×125 IO ops, or 7×125 IO ops, does it assume or 14×125 for reads and 7×125 for writes?
Since RAID 1+0 (RAID 10) only uses half of the disks for performance gain (the other half are for redundancy), you would see a 7 X 125 IOps factor for both random reads and random writes. Trayce J informed me that for READS – some controllers & firmware allow the reads to take place across all the disks in the RAID 1+0 array and thus give you potentially 14×125. Other controllers only give you the 7×125.
9) Please provide detail re: how the estimates for typical support of 2,000 agents/server. If the volume of operational data were to remain the same, would the number of management packs influence the number of agents/server? What’s the relationship between operational data volume and use of mgmt server resources?
An increased number of management packs puts a greater load on the RMS, but not the MS’s. So, this would not change the number of agents/MS, but may affect the number of agents that can be monitored in the deployment, due to potential bottlenecks on the RMS.
Higher operational data volume requires more management server resources.
Number of Management Packs, and more importantly the number of discovered instances impacts the Memory and CPU usage on the RMS.