I will be progressively moving these posts over from my previous blog from when I worked for Microsoft IT. The information is somewhat dated, but I didn’t want to see the content disappear completely :). You can think of this as a “snapshot in time” supplement to the Implementing System Center Operations Manager 2007 at Microsoft white paper.
If you’re reading this series of posts to help make procurement decisions and want even more detail I suggest checking out the following resources:
Following is the average scale of one of Microsoft IT’s OpsMgr 2007 management groups, and should be used as a point of reference for any performance figures provided in this series of posts:
· Agents: 3500
· Alerts per day: 3500
· Events per day: 453,000
· Perf Samples per day: 16.3 million
Before Microsoft IT purchased any hardware for OpsMgr 2007 they took a week long performance baseline of their MOM 2005 deployments. Based on numbers from MOM 2005 SP1 database systems they found that CPU utilization was fairly low (19% on average) as was memory utilization (average of 65 pages per second). The area where resource utilization was relatively high was at the disk level. Following were the average data transfer rates:
· Drive holding DB data file: 0.8 MB/sec with an average of 30 transfers/sec
· Drive holding DB log file: 0.6 MB/sec with an average of 162 transfers/sec
· Drive holding TempDB files: 0.9 MB/sec with an average of 20 transfers/sec
So when it came time to make a purchasing decision for OpsMgr 2007 hardware MSIT stuck fairly close to same platform configuration for their MOM 2005 OpsDB systems, working under the assumption that the newer hardware in the currents SKUs of the day would proportionately handle whatever increased load OpsMgr 2007 would bring. Following is what they ultimately used and where changes were made they are noted:
· Server Model: HP ProLiant DL385 G1
· Processors: 2 x dual core (4 procs in the OS’ eyes) 2.2 Ghz AMD Opteron Processors
· RAM: 8 GB – The MOM 2005 servers only had 4GB of RAM, but with the support for RAM above 4GB with the 64bit OS’ the increase was merited.
· Drives: 3 SAN drives for hosting the SQL data, log and TempDB files respectively.
o SQL Data drive: 130GB RAID 0+1 – In MOM 2005 this was a RAID5 30GB drive.
o SQL Log and TempDB Drive: 20GB RAID 1 – In MOM 2005 these were RAID5 8GB drives.
· OS: Windows Server 2003 Enterprise x64 Edition with SP1
With the platform listed utilization has been averaging 25% CPU utilization and memory usage resulting in an average of 1.4 pages/sec. Drive utilization has a bit more than doubled for the SQL data drive but reduced in half for the SQL log file. The drive utilization for the TempDB has remained fairly flat.
o Drive holding DB data file: 1.98 MB/sec with an average of 151 transfers/sec
o Drive holding DB log file: 0.3 MB/sec with an average of 72 transfers/sec
o Drive holding TempDB files: 1.07 MB/sec with an average of 20 transfers/sec
Moving beyond the hardware itself, two of the most significant changes made in IT’s deployment designs for the OpsDB were implementing Clustering and SQL Log Shipping.
Over the lifespan of Microsoft IT’s MOM 2005 deployment they found that in their environment, achieving 99.9% availability for the entire infrastructure for a single month was quite difficult. One of the major contributing factors that counted against their availability was that the OpsDB was a single point of failure and therefore every minute of it being offline counted against the availability of the overall infrastructure. So given that experience and the fact that customer requirements for availability of monitoring were only getting more stringent the IT monitoring team decided to implement clustering for high availability of the OpsDB. Now work such as patching servers, repairing or upgrading hardware, etc. can be performed on a single node of the cluster at a time, while maintaining the availability of the DB itself. Following are a couple configuration side-notes that relate to how MSIT configured clustering:
· The cluster model used is “single quorum device server clusters” and the quorum resource is stored on a 2GB shared drive.
· The MSDTC resource that is required for the clustered installation of SQL server is located in the same resource group as the quorum drive.
So while clustering provides high availability, it does not necessarily solve the problem of redundancy during a disaster. With MOM 2005 MSIT has implemented a derivation of the Service Continuity solution accelerator. While this solution worked well, it had the significant drawback of needing to “ship” a complete copy of the OpsDB at least once a day. Even with compressed backups this resulted in moving 8GB per management group every 6 hours. So designing the OpsMgr 2007 deployment IT needed something better for geo-redundancy of the DB. The solution was SQL log shipping. This configuration has allowed for geo-redundancy but it is worth noting that it comes with additional considerations of running the DB in Full Recovery mode and needing to maintain DB and transaction log backups. Following are some relevant configuration side-notes about the setup of log-shipping:
· SQL cluster and failover SQL server has a 30GB shared drive dedicated to storing SQL data file and log backups.
· A full DB backup is performed for the OpsDB data file every 24 hours.
· Log Shipping is configured to backup and ship the log files every 15 minutes.
· Log file backups are retained for 2 days on the source DB server and for 3 days on the destination server.
Additional SQL Customizations
As was mentioned above Microsoft IT made some additional customizations, specifically around implementing Full Recovery model and Log Shipping for the OpsDB. Following are some additional customizations worth mention.
Trace Flag 1118
This has been found to be rather unique to IT’s deployment but based on their experiences with beta/RC versions of OpsMgr 2007 they did achieve some performance improvements of the OpsDB by running SQL with the 1118 trace flag enabled. Trace flag 1118 is used for striping tempdb to overcome file contention. The following steps were taken to configure these optimizations:
Figure out how many logical processors the SQL server has and keep this number.
Open the properties of the TempDB and go to “Files”
Expand the existing data file to 1.5 GB and disable the autogrow settings
Add more data files to match the number of logical processors in the system (IF there are 4 procs then there need to be 4 TempDB data files). Make sure that all of the data files are the same size (1.5 GB) with autogrow disabled, and located in the same place as the TempDB’s default data file.
Apply the changes
Open “SQL Server Configuration Manager”
Click on “SQL Server 2005 Services” and in the right hand pane right-click on the SQL server instance for which you just added the TempDB files and select “Properties” from the context menu.
Switch to the “Advanced” tab
Scroll down to “startup parameters” and at the beginning of the value box (it’s very important to not put it at the end) add the following text: “-T1118;”