Written by Kip Ng, Principal Microsoft Premier Field Engineer.
Operating an IT system nowadays is a no doubt daunting task. As a Microsoft Premier Field Engineer (PFE), I have the honor of working with different customers of different sizes in different industries. I fully understand how overwhelming things can get.
What I find it interesting is that while many companies are working towards adhering to some sort of framework like ITIL and some measurements of service level agreements, many do not have a strong strategy and focus on making their IT environments healthy and keeping these healthy. Don’t get me wrong, ITIL is essential but it isn’t without some shortcomings, in my humble opinion.
As stated on Criticisms of ITIL:
While ITIL addresses in depth the various aspects of Service Management, it does not address enterprise architecture in such depth. Many of the shortcomings in the implementation of ITIL do not necessarily come about because of flaws in the design or implementation of the Service Management aspects of the business, but rather the wider architectural framework in which the business is situated. Because of its primary focus on Service Management, ITIL has limited utility in managing poorly designed enterprise architectures, or how to feed back into the design of the enterprise architecture.
I am not going to debate about what ITIL can or cannot do. What I am trying to say is that I have seen gaps in many companies, particularly the small to medium size companies (even those that follow ITIL processes religiously), where there are a lack of strategies in getting the IT implementation healthy to begin with.
Throughout my years as a PFE, I have helped quite a few customers understand the need to add additional layers of proactive services to their existing processes on an on-going basis to get their environment to a healthy state and then keeping these healthy.
So, what does it mean to have a “healthy” IT environment? How do we define the state of health here? Today’s systems are way more complicated than before. It is not as simple as installing something out of the box and running it. It is not that it isn’t going to work. It will work but the question is, “Is it running optimally?”
Think of it this way; take our own body for example. We can eat unhealthily and live unhealthily. Will we have problem initially? Perhaps not, but will we last and will we suffer the consequence in the end? I think you know the answer to that.
As I told many of my customers, the first thing we need to do is to find out if the environment is healthy. So:
1. Assess – find out the health state of the environment. What does that really mean? Oh, this isn’t a small task. We aren’t talking about just the system; we are talking about the whole environment, including the operations and the people as well. This is about reviewing everything, the solution, the environment, etc.and determining a number of things such as:
Are the systems configured according to the recommended practices by the vendor (such as Microsoft)?
Are the solutions designed according to specification and business needs?
Are the solutions meeting the appropriate security specification?
Are there performance issues or bottlenecks currently in the system?
Are we monitoring for any sign of issues currently?
Are there any disaster recovering plans in place?
Do your people having enough knowledge to effectively manage the environment?
Are there appropriate processes in place such as escalation processes, change management processes, etc.?
Are there any well-defined service level agreements for the environment?
How often do we update the system?
The list can go on and I can talk about this for hours. Some of the assessments above are addressed by operations frameworks like ITIL or MOF but there is also a large portion of it that’s technical and very technology specific.
2. Remediate and Stabilize – Assessment is great. However, if we just assess and we do not do anything about it, then it is as good as nothing. This again is also a potentially huge effort, depending on the findings from the assessment. Most companies I come across were shocked to see the results of some assessments and the work required to remediate risks and issues that resulted from the assessments. Some have chosen to ignore it and some has taken the approach of doing selective remediation due to lack of resources.
How important is this, really? Well, how important is it for you to keep your body healthy? I think the answer is clear. We all do (or at least should do) a medical check-up annually and take the necessary action to prevent any potential issue or problem because we know it is important to do so. Why is that any different for your critical system that you are operating?
I should probably highlight that this isn’t a one-time effort as well. Why? It is because sometimes, recommendations change as the environment changes and the products evolve. Each time you make a change, each time you increase the number of users, each time you upgrade, each time you apply hotfixes, service packs, each time you change your process, you need to keep track of those changes and re-assess the environment periodically.