***Note – The products and recommendations have changed over the years, so what applied to previous versions does not really apply today. Make sure you read the entire article!
The SQL instances and databases deployed to support SCOM, generally fall into one of two categories:
1. The SQL server is managed by a DBA team within the company, and that teams standard will be applied.
2. The SCOM team fully owns and supports the SQL servers.
Most SQL DBA's will set up some pretty basic default maintenance on all SQL DB's they support. This often includes, but is not limited to:
- CHECKDB (to look for DB errors and report on them)
- UPDATE STATISTICS (to boost query performance)
- REINDEX (to rebuild the table indexes to boost performance)
SQL DBA's might schedule these to run via the SQL Agent to execute nightly, weekly, or some combination of the above depending on DB size and requirements.
On the other side of the coin.... in some companies, the SCOM team installs and owns the SQL server.... and they don't do ANY default maintenance to SQL. Because of this all too common scenario - a focus in SCOM was to have the Ops DB and Datawarehouse DB to be somewhat self-maintaining.... providing a good level of SQL performance whether or not any default maintenance was being done.
Daily jobs that run for the OpsDB:
- 12:00 AM – Partitioning and Grooming
- 2:00 AM – Discovery Data Grooming
- 2:30 AM – Optimize Indexes
- 4:00 AM – Alert auto-resolution
Reindexing is already taking place against the OperationsManager database for some of the tables (but not all, and this is important to understand!). This is built into the product. What we need to ensure - is that any default DBA maintenance tasks are not conflicting with our built-in maintenance, and our built-in schedules:
There is a rule in OpsMgr that is targeted at the All Management Servers Resource Pool:
The rule executes the "p_OptimizeIndexes" stored procedure, every day at 2:30AM.
This rule cannot be changed or modified. Therefore - we need to ensure there is not other SQL maintenance (including backups) running at 2:30AM, or performance could be impacted.
If you want to view the built-in UPDATE STATISTICS and REINDEX jobs history - just run the following queries:
FROM DomainTable dt
inner join DomainTableIndexOptimizationHistory dti
on dt.domaintablerowID = dti.domaintableindexrowID
ORDER BY OptimizationStartDateTime DESC
FROM DomainTable dt
inner join DomainTableStatisticsUpdateHistory dti
on dt.domaintablerowID = dti.domaintablerowID
ORDER BY UpdateStartDateTime DESC
Take note of the update/optimization duration seconds column. This will show you how long your maintenance is typically running. In a healthy environment these should not take very long.
In general - we would like the "Scan density" to be high (Above 80%), and the "Logical Scan Fragmentation" to be low (below 30%). What you might find... is that *some* of the tables are more fragmented than others, because our built-in maintenance does not reindex all tables. Especially tables like the raw perf, event, and localizedtext tables.
This brings us to the new perspectives in SCOM 2016, especially when used with SQL 2016.
In SQL 2016, some changes were made to optimize performance, especially when using new storage subsystems that leverage new disks like SSD. The net effect of these changes, on SCOM databases, is that they will consume much more space in the database, than when using SQL 2014 and previous. The reason for this is deeply technical, and I will cover this later. But what you need to understand as a SCOM owner, is that the sizing guidance will not match up with previous versions of SQL, compared to SQL 2016. This isn't a bad thing, you just need to make some minor changes to counteract this.
SCOM inserts performance and event data into the SCOM database via something called BULK INSERT. When we bulk insert the data, SCOM is designed to use a fairly small batch size by default. In SQL 2016, this creates lots of unused reserved space in the database, that does not get reused. If you review a large table query – you will observe this as “unused” space.
Note in the above graphic – the unused space is over 5 TIMES the space used by actual data!
If you want to read more about this – my colleague Dirk Brinkmann worked on discovering the root cause of this issue, and has a great deep dive on this:
The SQL server team also recently added a blog post describing the issue in depth:
Do not despair. In order to clean up the unused space, a simple Index Rebuild or at a minimum Index Reorganize for each table is all that is needed. HOWEVER – these perf tables are NOT indexed by default! This was likely done when SCOM was designed, because these are not static tables, they contain transient data in the OpsDB, that is only held for a short amount of time. The long term data is moved into the Data Warehouse DB, where it is aggregated into hourly and daily tables – and those are indexed via built in maintenance.
To resolve this, and likely improve performance of SCOM – I recommend that each SCOM customer set up SQL agent jobs, that handles Index maintenance for the entire OpsDB, once a day. I’d say given the other schedules, a start time between 3:00 AM and 6:00 AM would likely be a good time for this maintenance. That lets the built in maintenance run, and wont conflict with too much. You should try and avoid having anything running at 4:00 AM because of the Alert Auto Resolution. We don’t want any blocking going on for that activity.
There are other performance benefits to reindexing the entire database, as many new visualization tables have been added over time, and these don’t all get hit by our built in maintenance.
A great set of maintenance TSQL scripts for Agent Jobs plan can be found at https://ola.hallengren.com/
Specifically the index maintenance plan at https://ola.hallengren.com/sql-server-index-and-statistics-maintenance.html
This is a well thought out maintenance plan, that analyzes the tables, and chooses to Reindex or Reorganize based on fragmentation thresholds, skipping tables if not needed at all. The first time you index the entire DB, it may take a long time. Once you set this up to run daily, it will only be optimizing the daily perf and event tables for the most part, which will be a single table containing one days worth.
After a reindex – I have freed up a TON of space. Here is the same DB:
Notice the huge decrease in “unused space”. Additionally, the total space reserved in my perf tables is now consuming less than one fifth the amount of space in the database it was consuming previously. This leaves you with a smaller footprint, and better performance. I strongly recommend you set this up or check with your DBA’s to ensure it is happening.
Data Warehouse Database:
The data warehouse DB is also self-maintaining. This is called out by a rule "Standard Data Warehouse Data Set maintenance rule" which is targeted to the "Standard Data Set" object type. This stored procedure is called on the data warehouse every 60 seconds . It performs many, many tasks, of which Index optimization is but one.
This SP calls the StandardDatasetOptimize stored procedure, which handles any index operations.
To examine the index and statistics history - run the following query for the Alert, Event, Perf, and State tables:
from StandardDatasetOptimizationHistory sdoh
inner join StandardDatasetAggregationStorageIndex sdasi
on sdoh.StandardDatasetAggregationStorageIndexRowId = sdasi.StandardDatasetAggregationStorageIndexRowId
inner join StandardDatasetAggregationStorage sdas
on sdasi.StandardDatasetAggregationStorageRowId = sdas.StandardDatasetAggregationStorageRowId
ORDER BY OptimizationStartDateTime DESC
In the data warehouse - we can see that all the necessary tables are being updated and reindexed as needed. When a table is 10% fragmented - we reorganize. When it is 30% or more, we rebuild the index.
Since we run our maintenance every 60 seconds, and only execute maintenance when necessary, there is no "set window" where we will run our maintenance jobs. This means that if a DBA team also sets up a UPDATE STATISTICS or REINDEX job - it can conflict with our jobs and execute concurrently.
I will caveat the above statement with findings from the field. We have some new visualization tables and management type tables that do not get optimized, and this can lead to degraded performance. An example of that is http://www.theneverendingjourney.com/scom-2012-poor-performance-executing-sdk-microsoft_systemcenter_visualization_library_getaggregatedperformanceseries/ They found that running Update Statistics every hour was beneficial to reducing the CPU consumption of the warehouse. If you manage a very large SCOM environment, this might be worth investigating. I have seen many support cases which resulted in a manual run of Update Statistics to resolve an issue with performance.
For the above reasons, I would be careful with any maintenance jobs on the Data Warehouse DB, beyond a CHECKDB and a good backup schedule. UNLESS – you are going to analyze the data, determine which areas aren't getting index maintenance, or determine how out of date your statistics get. Then ensure any custom maintenance wont conflict with built-in maintenance.
Lastly - I'd like to discuss the recovery model of the SQL database. We default to "simple" for all our DB's. This should be left alone.... unless you have *very* specific reasons to change this. Some SQL teams automatically assume all databases should be set to "full" recovery model. This requires that they back up the transaction logs on a very regular basis, but give the added advantage of restoring up to the time of the last t-log backup. For OpsMgr, this is of very little value, as the data changing on an hourly basis is of little value compared to the complexity added by moving from simple to full. Also, changing to full will mean that your transaction logs will only checkpoint once a t-log backup is performed. What I have seen, is that many companies aren't prepared for the amount of data written to these databases.... and their standard transaction log backups (often hourly) are not frequent enough (or tlogs BIG enough) to keep them from filling. The only valid reason to change to FULL, in my opinion, is when you are using an advanced replication strategy, like SQL Always On, or log shipping, which requires full recovery model. When in doubt - keep it simple.
P.S.... The Operations Database needs 50% free space at all times. This is for growth, and for re-index operations to be successful. This is a general supportability recommendation, but the OpsDB will alert when this falls below 40%.
For the Data warehouse.... we do not require the same 50% free space. This would be a tremendous requirement if we had a multiple-terabyte database!
Think of the data warehouse to have 2 stages... a "growth" stage (while it is adding data and not yet grooming much (haven't hit the default 400 days retention) and a "maturity stage" where agent count is steady, MP's are not changing, and the grooming is happening because we are at 400 days retention. During "growth" we need to watch and maintain free space, and monitor for available disk space. In "maturity" we only need enough free space to handle our index operations. when you start talking 1 Terabyte of data.... that means 500GB of free space, which is expensive, and. If you cannot allocate it.... then just allow auto-grow and monitor the database.... but always plan for it from a volume size perspective.
For transaction log sizing - we don't have any hard rules. A good rule of thumb for the OpsDB is ~20% to 50% of the database size.... this all depends on your environment. For the Data warehouse, it depends on how large the warehouse is - but you will probably find steady state to require somewhere around 10% of the warehouse size or less. When we are doing any additional grooming of an alert/event/perf storm.... or changing grooming from 400 days to 300 days - this will require a LOT more transaction log space - so keep that in mind as your databases grow.
Summary: (or TL;DR; )
1. Set up a nightly Reindex job on your SCOM Operations Database for best performance and to reduce significant wasted space on disk.
2. You can do the same for the DW, but be prepared to put in the work to analyze the benefits if you do. Running a regular (multiple times a day) Update Statistics has proven helpful to some customers.
3. Keep your DB recovery model in SIMPLE mode, unless you are using AlwaysOn replication.
4. Ensure you pre-size your databases and logs so they are not always auto-growing, have plenty of free space as required to be supported.