A Closer Look at Behavioral Classification

Hi, my name is Tony Lee. I am a virus researcher on the Microsoft Antimalware team. One of our top priorities is to conduct advanced research to combat malware problems. A significant challenge we have today is the large number of active malware samples, totaling in the order of tens of thousands, and increasing rapidly. It has become apparent to us that the traditional manual analysis process is not adequate in dealing with malware of this order of magnitude, and that we should seek automation technologies to aid human analysts. To address this challenge, we are conducting research on technologies which model human analysis, to enable autonomous processes that analyze and classify malware, in an automatic and adaptable manner.

As described by Matt’s EICAR recap post last week, my colleague Jigar Mody and I presented a paper on this research work at the EICAR conference at Hamburg, Germany. The subject was on automatic malware classification using runtime events and machine learning. The underlying approach we took involves capturing malware behavior in a time sequence of events, which is a knowledge representation we then used as input to a machine learning process to uncover similarity information across a large number of malware samples. The novelty in this research is the application of a distance-based clustering algorithm on behavioral data observed during malware execution. Past technologies and research attempts, such as rule-based, weights/thresholds and abstract feature set approaches, focus mostly on heuristics to detect generic categories of malware (e.g. malicious or not); common challenges include algorithms too generic to provide classification precision or difficult to scale to unaccounted characteristics.

Having looked into numerous research and technologies from the past, we decided to take a step back and approach the problem from ground zero. First, we conceptualized the classification process in terms of knowledge consumption, representation, learning, and application. We then tackled the basic problem of representing knowledge extracted from malware. By using event sequences as a representation, we were able to describe the ordered effects or system transition states observed from malware behavior. Unlike common statistics-based data mining techniques (association rules, Bayesian classifiers, term vector, etc.), we use instance-based learning, allowing objects represented in rich syntax (ordered sequence). We solved the object-distance problem by adapting Levenshtein distance to measure similarity between objects. We took an innovative approach to fine tune edit operation cost as a function of event type, values and operation type, in order to achieve optimal similarity precision. We then employed K-medoids clustering algorithm to construct semantic groups of malware objects based on similarity measure, classes to serve as the basis to family classification.

The preliminary tests, conducted among 3 to 11 families, with total over 700 variants, showed fairly high classification accuracy (up to 84%). The tests reveal a  consistent trend of improving results with respect to number of families, clusters and events. As the number of events used in the similarity measure increase, we see the increase in accuracy as we expected. We also found that number of clusters and families affect the classification accuracy positively, because, given proper similarity measure, the more semantic groups proposed, the more centers or space for data points to be drawn to the right groups. We also observe the outlier effects were contained to the degree in proportion to the number of clusters proposed, because of the stronger collective “gravitational forces” due to the increased number of centroids. For more detailed observations, please see the paper available for download from the Microsoft Download Center.

We are continuously working on optimizing the algorithms and techniques of this system, such as the similarity measure precision, clustering algorithm efficiency, malware replication system effectiveness, and applications in domains such as automated behavior descriptions and correlation analysis. The method we have proposed in the paper is one of the many routes towards a solution that addresses the challenge of a growing number of malware in the wild.

-Tony