The future of independent antimalware tests

Our guiding vision at the Microsoft Malware Protection Center (MMPC) is to keep every customer safe from malware. Our research team and machine learning systems, as well as industry engagement teams, function around the clock in an effort to achieve this vision.

As part of these efforts, we are also working with independent antimalware testing organizations towards advancing the relevance of independent testing and reporting. Our goal is to help enable independent antimalware testing organizations to test using malware that has significant customer impact. We have come a long way together, and we can still make significant advances to on-demand file-detection tests.

Current on-demand file-detection tests have some limits. They are typically carried out by first assembling a set of malware samples, and then scanning them with antimalware products. The samples in the testing set that aren’t detected by the products are counted, and then their percentage is calculated. Finally, the undetected percentage is compared to other products to calculate the comparative test results. Some testers use prevalence data to choose their sample set, and some apply curves to the results, but ultimately the fundamental test scheme is the same across the board.

One major issue with the above methodology is that there is no differentiation between samples in the test set. While each sample in the test set has a different impact on customers, in the above methodology, they are weighted equally. This methodology has been of concern to us, as it doesn’t take into account the prevalence-based customer impact.

To evolve antimalware test methodologies, this problem can be solved by weighting these samples according to their customer impact – that is, how often a particular malware sample is encountered by customers. The first step is to apply a weighting based on each specific sample’s prevalence; if the sample has impacted a large number of customers, then it will have a relatively large weight. If it’s impacted relatively few customers, then it will have a smaller weight. However, this approach isn’t quite enough.

Different malware families have different behaviors. For example, some malware families use polymorphism: they change their files with every infection, causing many samples within that family to be relatively low prevalence. In this case if the malware family has a high prevalence, but each sample has a low prevalence, then without a family weight these samples are lost in the mix. To address this, a family weight should be included in addition to the specific sample prevalence weight.

After applying the weights described above, it is possible to generate a risk factor that describes how much risk a customer faces depending on which antimalware product they use when exposed to samples in the test set. On top of that, using geographical sample weights and family weights allows for a geographical risk breakout.

This kind of prevalence-weighted test is a game changer. Shifting to a weighted approach will help customers and antimalware vendors understand how their products perform in the real-world, based on real malware prevalence and impact.

There are a few caveats to such a test. The most significant is the prevalence data itself. Where would this prevalence data come from and how would it be validated? Ideally the data can be generated and composed by an antimalware industry collaboration. The MMPC is contributing to this data and is working with other independent testers to validate it.

With the participation of the MMPC and other antimalware vendor collaborators it is possible to produce the best and most meaningful set of on-demand test results yet. This is the next step in our continued journey with independent antimalware testers to drive more relevancy into testing.

Joe Blackbird

Comments (5)

  1. adwbust says:

    so youre saying that tests focus on one family based on impact? but the test is usually already focused on one criteria already – on-demand, proactive/dynamic, behavior, repair, etc. it's like saying a student should focus on science since he wants to
    be a dr and dilly dally on other subjects. antimalware should be realistically holistic. just because my user doesn't encounter a certain threat, doesn't mean the threat should be ignored or played down.

    besides, users encounter threats randomly. well, mostly based on geo or user behavior/activity (risk factor). also why would a testing org obtain data from the corps? that would make it biased (sponsored), make the testers look lazy and already spell out the
    results of the test. asian av say they encounter more gamestealer, whereas usa av encounter more fakeav. of course, asian av will get stellar results on gamestealer family and poor results on fakeav which it doesn't encounter much!

    what testing orgs should test is the response time of adding "detection" and "removal" sigs. mse sucks there. even if submitted sample is not prevalent, a user submitted and encountered the threat. mse didn't detect it. so you expect the user to remove it manually
    or wait months before mse takes action? 🙂 most avs also just detect by hash. not effective against smart affiliate downloaders. they want me to submit every downloaders for every software on the download site. lol. at least use some fuzzy.

    mse also lacks a working AI cloud to classify data encountered. mse still relies on lab automation (local) and manual analysis (by analysts). behavior protection is passive (collection) and malware criteria is playing safe on grey threats. dont want to dip
    your hand on too much technicality aye? too much work? 😐

  2. adwbust says:

    hey joe. i have a question, why cant mse decrypt files encrypted by ransomware? kaspersky and dr web can. does that mean ransomware is more prevalent in russia and are more prioritized there? or… 😛 mse also doesn't do too well at removal of threats.
    mse is like mrt – mrt that is updated daily. real mrt is updated monthly. lol

  3. Guitar Bob says:

    Giving all samples the same weight might be okay if the sample set is relatively new. It might be a good test of heuristics or generic signatures, which certainly should be considered in evaluating an AV. Virus Bulletin does something like this with the
    RAP test they give for each AV tested.

  4. fuck microsoft says:

    microsoft is bulls**t. take linux