Perform Your Own IT Operational Assessment

During my time as a Microsoft PFE, I contributed to numerous IT Operational Assessments.  While there are many tasks within an Operational Assessment, I wanted to provide a ‘simple version’ by reviewing past tickets/outages over a specified time period.  Of course, there are other, more formal reviews, such as Microsoft Operations Strategic Review, but this one is simple and anyone can do it. 

Why should I care? If done properly, the results allow the (management, operations, engineering, etc) teams to better prepare their staff (ex: Training), anticipate problems (ex: Identifying underlying issues), and manage operations (ex: Improve processes).  As a PFE, I have used this to gain insight around:

  • Which IT Service (Product/Solution) is generating the most tickets and consuming the most man-hours to manage
  • Determine how well you align to the service level or operating level agreements (SLA & OLA)
  • If services are trending in terms of resiliency and availability
  • What areas can be improved to reduce time-to-resolution

I recommend this as a quarterly report with an end-of-year review, but you can do it more/less frequently based on your needs and if you have the data available. 

Now there are many ways to do such a review and I ALWAYS encourage engaging the experts in this area because they will often find things that you won't. Besides another set of eyes rarely hurts. But this review is great do-it-yourself starter kit (so to speak) and can often help you identify easily-resolved items which have a meaningful impact and provide justification to accomplish operational health tasks.  Of course this really only works well if the data that you are using is accurate and available.

First, let me explain what this will NOT do:

  • Performance analysis of any kind
  • How well any specific application, product, or solution is performing in isolation (I prefer to look holistically)
  • Provide HR-related fodder if you are trying to build a case to hire/fire someone
  • Quantify Service availability numbers (i.e. did you achieve 99.999% availability)

OK, let's get started...

STEP 1: DEFINE YOUR REQUIREMENTS

With any project, you should define your requirements, scope, and definitions to provide those core elements necessary for a comprehensive operational assessment. This may include the following: (see attachments for how I used them)

  • Intent of the document/OAR
  • Data Collection Frequency: Monthly/Quarterly
  • Report Generation Frequency: Quarterly & Yearly
  • Scope of Data Collection: Organization vs. specific Product/Solution
  • Service Management Categories: People, Process, Environment, Technology, Other, Unknown, etc.
  • Severity Levels: 1-Critical/High Impact, 2-Severe/Significant Impact, 3-Moderate/Impact, 4-Nominal/No Impact
  • Service Desk Common Resolution Classifications

NOTE: These items will vary between each organization, so be sure to document what each means to you.

STEP 2: DATA COLLECTION

Typically I recommend collecting the data monthly as it provides a good timeline structure without overwhelming me with data, but each person or environment may have their own preference. Start by collecting all trouble tickets, incidents, change/work requests, unscheduled/scheduled maintenance notifications, etc. generated during the time specified. For each item collected, document the following types of data:

  • Highest Severity Level
  • Current Status (ex: Open-OnHold, Open-Active, Closed-Unresolved, Closed-Resolved, etc.)
  • Impacted Technology (ex: Exchange, Active Directory, SQL Server, etc.)
  • Impacted Services (ex: Messaging, Directory Services, Database Services, etc.)
  • Average time (in hours) to acknowledge/react, resolve, & closure (ex: Ack:1hr, Res:.5hr, CL:1hr)
  • Categorize the item based on solution/root cause
  • Resources Used (ex: 2 Teams / 2 Staff )
  • Scheduled / Expected: Y/N (only applies to approved changes and project implementations)

NOTE: Again, these will vary within your organization and you might include more/less information. For example, some may include Uptime/Downtime, Perf Metrics, Storage Consumption based on department/office/technology, etc. Just don't get garbage data that might 'fudge' the numbers or get lost in mounds of too much data.

STEP 3: INPUT DATA AND PERFORM SUBJECTIVE ANALYSIS

Input the data into the spreadsheet (see attached) and then apply some subjective decisions on the information. For example, the Service Management Category might mean one thing to 1 person and something else to another. Just try to stay consistent and broad. Try not to get too narrow or restrictive, otherwise you'll have 50-100 different paths to choose from.

STEP 4: GENERATE A REPORT BASED ON THE DATA

Consolidate the data into a single spreadsheet and report and provide an analysis of your findings. Attached is a sample report. The key is to not be too subjective, try to keep to the facts. However, when you need to be subjective, try to maintain consistency.

I hope this helps!  Good luck!

Da

IT Ops Assessment.zip