So You Want to Roll Out SCOM: Decisions that you should make before you click Install.

Disclaimer: Due to changes in the MSFT corporate blogging policy, I’m moving all of my content to the following location. Please reference all future content from that location. Thanks.

I’ve been on a process kick lately, in large part because issues that I encounter in SCOM environments aren’t related to the technology but the processes that surround them. From there, I decided to put together a nice primer of all of the things that need to be done in advance, or at the very least considered in advance.

The Technical:

Kevin Holman has a great quick start guide to setting up SCOM.  That really covers most of the technical side of the deployment, though there are a few key things you may want to consider before you start installing:

  • SQL Server location:  Are you going to keep the SQL server local to the SCOM server?  Do you have an enterprise cluster?  How are you going to carve out the DBs?  A default SCOM install uses two databases, though you can add more with reporting as well as ACS.  From a best practice standpoint, you may want to consider carving out separate volumes for the DB, DB log file, DW, DW log file, as well as the temp DB.  SCOM is very disk intensive, and isolating these five databases to their own volumes will help with performance.
  • Account naming convention, and if you want to create all of them:  Here’s what is required.  Kevin mentions the accounts in his quick start guide as well.  Oh, and for the sake of all things security, DO NOT use  a domain admin account for any of these accounts.  Also, don’t forget to configure your SPNs.
  • Sizing:  I get asked a lot as to how big to make the environment.  The answer is rather vague.  It really depends.  Microsoft provides a nice sizing guide that can help answer those questions.  The size of your environment really depends on what you want to monitor, as well as how much availability you want.
  • Backup strategy:  Do you want to backup just the databases?  That’s the traditional method, though I strongly recommending having a restore procedure in place and validated if that’s the plan.  I’d note that if space is a premium, you may want to consider grabbing your unsealed customizations.  That takes up a lot less space (at the cost of a total los of historical data in a disaster), but there are easy ways to do this, either by management pack or by script.
  • Data Retention:  The SCOM DB doesn’t keep data all that long, though in a larger environment, you may want to reduce those settings. In a smaller environment, you may want to consider increasing them.  DataWarehouse retention is a bigger deal as this is not configurable via the SCOM console.  The DW can also get rather large, particularly state and performance hourly data which has a default threshold of 400 days.  This can lead to a very large DW and a very angry storage administrator who wants to know why you need all that space.  I personally recommend keeping the daily aggregations for 365 days and the hourly aggregations for about 120 days.  That really is an organizational decision, but one that should be made early on.

The quasi-technical:

This is still somewhat technical, but there’s procedural considerations to be had here.

  • Naming convention for SCOM customizations as well as custom management packs:  I strongly recommend using some sort of organizational name to lead all of your customizations.  The reason being is that six months from now, you won’t likely remember what you named that custom monitor.  Save yourself the time and make sure you have some sort of consistent naming convention, for no other reason than to allow for an easier search.
  • What do you want to monitor:  Let’s start with the obvious, SCOM is a framework that can monitor a lot of things.  Microsoft makes management packs for most (if not all) of their products.  There are a lot of 3rd party MPs out there as well (note, not all are free).  Not all are that good.  Oh, and most importantly, don’t make the mistake that myself and many others have made in rolling all of them out at once.  You’ll want to tune each MP, so roll them out progressively (preferably in a QA environment first) so as to identify noise before you roll them out into production and have an angry monitoring team.
  • Alert Management:  I’ve written a 3 part piece on this subject.  The first part is here (and it links to the other two).  Suffice to say, most organizations don’t really sit down and think about how they plan on responding to alerts.  The end result is that the organization has purchased a monitoring tool, but it does not monitor. 
  • What to monitor:  Are you going to throw your development systems into your production SCOM environment?  Do you really only care about a few core systems?  The bottom line is that SCOM is going to tell you lots of things about your environment.  It’s great at detecting bad IT hygiene, and it doesn’t know what items are by design or not.  If you want to actually have a good process responding to alerts, then you probably want to sit down and decide what systems are important to alert on.  If you throw everything into one environment, you are going to make it very difficult on those who are supposed to do the monitoring.
  • What processes need changing:  This goes back to alert management, but the bottom line is that plenty of organizational processes will need to change to account for SCOM. A short list includes maintenance process, decommissioning of servers, commissioning of servers, responding to alerts.
  • Who needs access:  Contrary to a lot of systems, most of your IT staff really needs to be only an operator.  Their job is to close alerts, reset state, and dashboards/reports.  You probably don’t want to give them the rights to start customizing your environment.
  • Custom Views:  This involves meeting with the various teams in your organization, but you’re going to want to get them using SCOM.  This means that they should likely have a scoped role so as not to be exposed to items that they don’t need to see.  It may also involve creating custom dashboards for them.  There’s a lot of really cool things you can do with dashboards.  Here’s one for start.

Documentation:

I’m married to a Quality Manager, so I get to hear about this every day in the manufacturing world, and I happen to have a degree in Manufacturing Engineering to go with it.  Needless to say, IT doesn’t do documentation that well, especially when it comes to break fix.

  • Customizations:  This is a big one.  I typically recommend implementing a basic versioning system for your custom MPs and using the built in description field to record the version number, what changed, who changed it, and why.  It doesn’t need a committee or special change management, but it can be very useful in keeping a running history of what was done in the environment.  As a bonus, when the SCOM owner leaves, his or her replacement will be able to pick up the changes much easier.  What often happens, however, when there’s little documentation is that the new administrator is often very tempted to simply start over. 
  • Health Check:  Microsoft offers an excellent service for many of their products known as a health check. You may want to consider doing something like this within the first few months of rolling out SCOM. A health check will determine if there are any performance bottlenecks in your environment as well as identify potential issues that you may need to address.  It will help you see where your best practices might be falling a bit short and allow you to maximize your use of the tool.  It’s not a requirement by any means, but it will provide you a very nice picture of your environment as well as a direction in terms of what needs to be addressed going forward.  (Shameless self promotion, but if by chance someone reads this and decides to purchase one, please be so kind as to let their account manager know that you read it here.  Those types of things look great on reviews).