The Need for a Performance Analysis of Logs Tool

Introduction

Performance analysis of log files (*.blg files) in Microsoft Windows has primarily been a manual process for as long as I can remember. We [Microsoft] have made some strides in this area, but there is still little out there that analyzes log files *after* a problem has occurred. Furthermore, nearly all of the Microsoft tools require the user to capture performance data as the problem is occurring. While this is a great way to analyze a performance problem, it is sometimes impractical. This would be similar to asking a criminal to reenact a crime scene while we film it. Therefore, we need a tool which analyzes existing logs and pieces together the evidence after the performance issue (crime scene) has occurred. Until this happens, we must continue analyzing logs manually with our limited, individual knowledge. The knowledge is all around us, we just need to harness it. The Performance Analysis of Logs (PAL) tool (https://www.codeplex.com) is our first step towards the realization of this goal, but we cannot do it alone.

The Challenge

I go on-site with customer every week to assist with performance issues. Most of the time the customer has perfmon logs (*.blg), Event Logs (*.csv), IIS logs (*.log), and if we are lucky a Event Tracing for Windows (ETW) log (*.etl). The assumption is that network administrators should analyze these logs by hand assuming they know their environment best. The problem is that in order to *properly* analyze these log files, you would need have a working knowledge of Windows Architecture. The reality is that most people have a heavy work load and do not have time to thoroughly understand Windows Architecture enough to keep up with it. I know enough about Windows Architecture to put the above log files to use and to typically formulate an hypothesis, but even I struggle with understanding some of the issues that I see and need to rely on others who know more. Even when you have a social network of subject matter experts, the process of analyzing Windows performance can still be slow. Therefore, the knowledge these experts have needs to be consolidated into an analysis tool or a central location. Many people may say that performance analysis is an art versus a science. Well I say, let's take the art out of it and make it a science as much as we can.

One of the fruits of my team's labor towards this goal is the PAL tool (https://www.codeplex.com/PAL). The PAL tool (Performance Analysis of Logs tool) is a tool that takes in all of the variables needed to analyze a performance issue and generates a report on its findings. It does this by interviewing the user to find out more about the computer, then using those answers in its analysis. Now, you are probably asking me, “Why did we write our own tool when there are so many great performance analysis tools that Microsoft has written?” Well, let’s talk about a few of the more recent and relative tool that I am aware of and how PAL is different:

Microsoft Server Performance Advisor (SPA): SPA was written by the Windows Fundamentals team. It analyzes perfmon logs and ETL (Event Tracing for Windows log). It does a great job of aggregating data, but analyses are based on XPath statements which unfortunately is not flexible enough to consider all of the factors in analyzing performance. Furthermore, the problem must be reproduced while it’s gathering data. The tool should only be ran for short time periods due to the large amount of data it gathers, and finally takes in *all* of the data points in the logs as its average values – meaning if the problem occurred for 1 minute and the collection period is 20 minutes, then the averages are skewed. In any case, I am very fond of this tool and highly encourage it's use. As a matter of fact, I'm reusing many of it's concepts in the PAL tool today. SPA has the right idea, it just needs to be taken to the next level. You can download it at: https://www.microsoft.com/downloads/details.aspx?FamilyID=09115420-8c9d-46b9-a9a5-9bffcd237da2&DisplayLang=en

Microsoft Visual Studio Profiler: This tool is very good, but its focus is on application functions versus operating system performance. I firmly believe that this is one of the best approaches to performance analysis for applications because you can identify what functions that the application is waiting on. Unfortunately, you can only profile one process at a time, the profiling has a little bit of overhead, and you have to reproduce the problem as it is occurring. I wrote a white paper on how to do this in a production environment located here (https://go.microsoft.com/fwlink/?LinkId=105797). I recommend using both this tool and a performance counter analysis tool such as PAL.

Microsoft xPerf/xTrace: Written by the Windows Fundamentals team. This the latest/greatest tool out there for perf analysis. Unfortunately, it is currently lacking an intuitive UI and analyses around the data collected. With that said, they are rapidly improving it. Like many of the other tools you must be able to reproduce the problem as it is occurring. Furthermore, it only runs on Windows Vista and Windows Server 2008. Unfortunately, customers are not continuously logging ETW data, so this doesn’t help me much for post analysis. Finally, it only analyzes ETL – no other log format. This would be like asking a crime investigator to analyze a crime scene using only one type of evidence. If that type of evidence isn’t available, then no analysis can be done. xPerf is part of the Windows Performance Toolkit located here: https://www.microsoft.com/whdc/system/sysperf/perftools.mspx

Microsoft System Center Operations Manager (SCOM): I’m always impressed with the SCOM team and their product. They analyze perf data as they go and do a great job with providing guidance and trends on the data shown. While this great for customers who have SCOM… not all customers have it installed. Therefore, I am again left with manual analysis. Furthermore, SCOM might not have all of the data I need to analyze a problem.

Microsoft Log Parser: This is an incredible tool that parses many log types in an easy to use Sequel Query Language (SQL) syntax. It just doesn't do any analysis. Therefore, the PAL tool uses Microsoft Log Parser as its data access layer. Microsoft Log Parser can be downloaded at: https://www.microsoft.com/downloads/details.aspx?FamilyID=890cd06b-abf8-4c25-91b2-f8d975cf8c07&DisplayLang=en

There are certainly more tools out there, but my point is that none of the tools above meet the all of the needs of post log analysis. This is why the PAL tool initiative was started.

The Solution

I strongly feel that if you complain about something, then you need to offer a viable solution, so here are the requirements of a tool that would be of practical use in the field.

Consolidated Guidance: First, there needs to be a central repository of guidance on performance analysis. We have many great whitepapers out there, but the knowledge is spread out and it takes a great deal of time to read and understand them especially when you are trying to solve a problem. It’s like a guy bringing his car to an automotive repair shop and the mechanic hands the guy a huge book and says, “you can fix your car by reading this”. The guy will ask, “this is nice, but how to I fix my problem?” Shane Creamer’s Vital Signs workshop has done a great job with consolidating the basics into a short, 2-day workshop offered by my team (Microsoft Premier Field Engineering). The PAL tool has nearly all of the consolidated guidance from the Vital Signs workshop built-into its report, so when a threshold is broken the guidance is context sensitive to that threshold. If you are interested in the Vitals Signs workshop, then contact your Microsoft Technical Account Manager (TAM).

Log File Data Access Layer (DAL): Next, a simple to use data access layer is needed to analyze log files in a common way. The Microsoft Log Parser tool is a great tool for this, but it is based on legacy COM. No future versions of it are planned at this time, but I have asked the IIS product team to write a new version it. They are considering it. Currently, the PAL tool uses Microsoft Log Parser as its DAL, but inherits some of the same limitations of Log Parser because of this.

Analyze More Data Points: Some of our log analysis tools such as SPA reads in the entire log and generate an average, minimum, and maximum values from it. This doesn’t cut it when the problem occurs only in a small portion of the log because the problem is averaged out by the sheer size of the counter log. The PAL tool breaks down perfmon logs into smaller time slices and analyzes each time slice individually for better accuracy.

Dynamically Changing Thresholds and Interviewing the User: One of the assumptions most performance analysis tool make is that they assume the user knows how to change the threshold and what to change them to. Likewise, unless you are a Windows architecture guru, then you as the user assume the tool is using the appropriate thresholds. When you have both parties relying on the other to make the best decision, then this can cause confusion and misdiagnosis. In order for next generation tools to be effective, they need to have dynamically changing thresholds based on the environment. The point is that you need to have a tool that learns the customer’s environment and adjusts its thresholds appropriately even if this means simply asking the user for some additional input. For example, to determine if a computer is running out of paged pool or non-paged pool memory the PAL tool asks the user a series of questions to estimate the maximum sizes of these memory pools, then computes a respective 60% and 80% threshold for it. The PAL tool does this by running executable code at run-time using the user’s input as variables for the code to determine if the thresholds are broken. Using executable code as the thresholds and being able to ask the user questions about the environment makes the tool flexible enough to handle nearly any performance analysis challenge.

Reusable: Our next generation tools need to be reusable – meaning portions of it can be reused by other applications and tool. Luckily, tools like xPerf are modulized in this way, but I wanted to emphasize that this needs to continue. Currently, the PAL tool is a hybrid VB.NET/ VBScript and open source, so users can simply copy code they want to reuse.

Free and Public: Our next generation tools need to be free for all of our users. Many times when tools become intellectual property (IP), then they inherit licensing restrictions such as a cost to use or other restrictions. The PAL tool is a free, open source tool available at https://www.codeplex.com/PAL.

Extensibility of the Thresholds: As mentioned above, the thresholds need to be executable code to be flexible enough to handle complex analyses. In addition, the code needs to be open to where users can add to them or update existing ones. This is important because no one person can claim that they know all of the technical aspects of all performance problems in Windows. You have to allow the people who are experts in their field to have the empowerment to add, edit, and delete the thresholds in the tool. The PAL tool accomplishes this by using VBScript for the thresholds and the VBScript is embedded in XML based threshold files. Included with the PAL tool is an editor to make it easy for subject matter experts to edit PAL threshold files. Furthermore, with the help of other subject matter experts, we have several product specific threshold files namely Active Directory, IIS, MOSS, SQL Server, BizTalk Server, Exchange Server, and general Windows.

Low Requirements: Some of the tools written by other teams at Microsoft such as the Visual Studio Load Test tool require a back end SQL database in order to process the collected data. Many people in the field don’t have SQL Server running on their laptops, so our tools need to able to run on workstation class computers. The PAL tool simply requires Microsoft Log Parser and Office Web Components 11 both which are free downloads.

Conclusion

We need a tool that can analyze a wide variety of logs similar to how a crime scene investigator analyzes a crime scene by gathering the evidence from the scene (in this case the log files) and analyzing them with scientific precision. The PAL tool is a tentative solution to the problem and has enjoyed great success with it with over 2000 downloads per month. With that said, we cannot do this alone especially since this is not part of my regular job. While a few of the Microsoft product groups are starting to follow some of the concepts of the PAL tool, no product group at Microsoft that I know specializes in performance analysis in this fashion. The possibility of creating a tool with all of the aspects I mentioned above is difficult. With that said, the benefits of such a tool are clear – the better Microsoft Windows performs, the happier customers are with Microsoft products. In the end, my real intention is to simply make my parents computer run faster. ;-)

Moving Forward

If you want to assist with this effort, then please try out the PAL tool and help with the development of it. For more information on the PAL Tool, please go to https://www.codeplex.com/PAL.

All my posts are provided "AS IS" with no warranties, and confer no rights. For PFE Job Opportunities at Microsoft, please visit our website at: https://members.microsoft.com/careers/search/default.aspx - search for keyword “PFE”
“PFE: The best place to be at Microsoft”