Insider's Guide to Comparative Anti-Virus Reviews

Article
08/20/2007

There has been a certain amount of excitement and irritation in anti-virus
research circles about a not-very-good comparative test of antivirus
scanners that was conducted at LinuxWorld on 8th August, 2007. I was so
exercised personally that I sat down and wrote a long white paper (free,
gratis and unpaid by anyone) on Untangling the Wheat from the Chaff in
Comparative Anti-Virus Reviews.

Here, though, is a less irascible summary that might give you some pointers on
assessing how good a comparative test is likely to be.

1) Small sample sets (the Untangle test used 18 discrete samples) tell you
how a given scanner performed against a small set of presumed malware, in a
specific "snapshot" context:
. According to the testing conditions
. According to the way the scanner was configured for the test
They tell you nothing about how the scanner will perform against any other
sample set. If you want to test detection of In the Wild (ItW) viruses
meaningfully, you have to use a full and fully validated test set that meets
an acceptable definition of "In the Wild," not a few objects that may be
viruses and may be ItW. How many is a full set? Well, the current WildList
at the time of writing consists of 525 viruses on the main list (1,958 if
you count the supplementary list). See the wildlist website for an
explanation of how these lists work, and Sarah Gordon's article "What is
Wild?" for a consideration of what we mean by ItW.

Why should you believe what the WildList Organization tells you? Well, there
are problems. WLO only tracks viruses (at the moment: that is changing), and
the list is always months out of date, because of the time it takes to
process and cross match samples, and so on. But that's the point. WildCore,
the WLO collection, gives a tester a sound, pre-verified collection to work
from (though testers with access to that collection are still expected to
generate and validate their own samples from it, not just run it against
some scanners. It doesn't, and can't, include all the viruses (let alone
other malware) currently in the wild in a less technical sense, but it does
give you a dependable baseline for a valid sample set. Of course,
professional testing organizations don't necessarily only test detection of
malware in the wild. They may also test for zoo viruses and other malware
that isn't known to be in the wild. This is important, because they cannot
assume that a customer will never need to detect or protect against these.
They may also test heuristic detection, time to update, and some forms of
usability testing, but I won't go into detail on these interesting but
complicated methodologies on this occasion.

2) Unvalidated samples invalidate a test that uses them. A collection needs
care and maintenance, as well as a significant test corpus, and sound
validation is a critical factor in professional testing. Assuming that
samples from blackhat web sites or your own mailbox are (a) malware and (b)
specific malware variants because your favourite scanner tells you so is not
validation, and offers no protection against false rejections (false
positives). If that scanner happens to be one of the scanners you're
testing, you introduce an unacceptable degree of bias in favour of that
product.

3) While a voluntary community resource can make a significant contribution
to the common weal, even in security (ClamAV, Snort, and so on), it can't
match a full strength industrial solution in all respects (contractual
support, for example). When people find a positive attribute in an object,
such as a $0 price tag, they're tempted to overestimate its other positive
attributes and capabilities (this is sometimes referred to as "halo
effect"). That's understandable, but it has no place in a rigorous testing
program, and to be less than rigorous when you're making recommendations
that affect the security and well-being of others is reprehensible.

4) Other concepts you should be aware of are ultracrepidarianism and False
Authority Syndrome, which can be informally defined as a tendency for those
with a platform to speak from to overestimate their own competence in
subjects in which they have no specialist expertise. When looking at a test,
you are advised to take into account the expertise and experience of the
individual conducting the test. The widespread popular distrust of the
anti-virus community extends not only to attributing malicious behaviour to
AV vendors ("they write the viruses") but to assuming their essential
incompetence. Strangely enough, there are some pretty bright people in
anti-virus research. Scepticism is healthy, but apply it to people outside
that community, not just those within it! As a rule of thumb, if you think
that anyone can do a comparative test, that suggests that you don't know
what the issues are. One of the politer comments directed at me after I
published the paper was "You don't need to be a cook to tell if a meal
tastes good." Perfectly true. But you do need to know something about
nutrition to know whether something is good for you, however good it tastes.

5) In the Wild is a pretty fluid concept. In fact, it's not altogether
meaningful at all in these days when worms that spread fast and far are in
decline, and malware is distributed in short bursts of many, many variants.
Come to that, viruses (and worms) are much less of an issue they were.
Anti-virus isn't restricted to that arena (though you may have been told
otherwise by instant experts), but it can't be as effective in all areas as
it was when viruses were public enemy number one. On the other hand, no
solution is totally effective in all areas. The AV research community is
(slowly and painfully) coming to terms with the fact that the test landscape
has to change. Mistrust any test that doesn't even recognize that the
problems exist.

6) There are a whole raft of problems connected with the types of object
used for testing by non-professional testers: non-viral test files, garbage
files from poorly maintained collections, unvalidated samples from malware
generators, simulated viruses and virus fragments, and so on.
* If you don't know anything about the test set, assume the worst.
* If you don't know where it came from, mistrust it.
* If you don't know how or if it was validated, mistrust it.
* If you suspect that it came from one of the vendors under test, or that
the only validation carried out was to identify samples with one of the
scanners being tested, mistrust it. With extreme prejudice.
* If you don't know what the samples were or how many, mistrust them.
* If you're offered the chance to test the same set for yourself, be aware
that unless you're an expert on malware and testing, or have reliable
contacts in the community who can do it for you, you'll probably reproduce
faulty methodology, so the results will be compromised.

7) Sites like VirusTotal are not intended to conduct any sort of comparative
testing, they're for trying to identify a possibly malicious object at a
given moment in time. Unless you know exactly what you're doing, any results
you get from such a site is useless for testing purposes, and if you ask the
guys who run these sites, they'll usually agree that comparative detection
testing is an inappropriate use of the facility.

8) The EICAR test file is not a virus, and doesn't belong in a virus sample
test set. It's perfectly reasonable to test how scanners process the EICAR
test file, but the fact that a scanner recognizes it doesn't prove that
test-bed scanner apps have been configured properly.

9) Your test-bed apps have to be similar in functionality and platform, and
should be configured carefully so that no single product has an unfair
advantage. One particularly memorable example some years ago was a test
(nothing to do with Untangle) that reviewed several known virus scanners and
a single generic application. The latter was given the "editor's choice"
accolade because it stopped the entire test set from executing. This sounds
fair enough unless you realize that many people and organizations still
prefer virus-specific detection because generic products can't distinguish
between real threats and "innocent" objects: this usually means that either
all objects are blocked (executable email attachments, for instance) or else
that the end user has to make the decision about whether an object is
malicious, which, for most people, defeats the object. By failing to
acknowledge this issue, that particular tester invalidated his conclusions
by imposing his own preferences on what should have been an impartial
decision. In other words, apples and oranges look nice in the same fruit
bowl, but you need to know the difference between them before you choose one
to eat. You also need to be clear about what it is you're testing.

The Untangle test was presented as a test of known viruses in the wild. In
fact, because of the methodology used, it effectively tried to test several
things at once. There seems to have been no separation between desktop
scanners, appliances or gateway scanners, or between platforms, or between
command-line and GUI interfaces. The tester failed to recognize that he was
actually trying to conduct four tests at once: recognition of the EICAR test
file, recognition of presumed "wild" malware, recognition of presumed "zoo"
malware (known, but not necessarily in the wild), and unknown presumed
malware (essentially a test of heuristic detection). Even if he'd got
everything else right, the test would have been let down by the muddled
targeting.

10) There are reasons why some tests are generally considered valid by the
AV community. Some of those reasons may be self-serving - the vendor
community is notoriously conservative - but they do derive from a very real
need to implement a stringent and impartial baseline set of methodologies.
Unfortunately, to do so requires considerable investment of time and
expertise, and that's expensive. (That's one of the reasons that many
first-class tests are not available to all-comers.) To understand what makes
a test valid, look at the sites listed below, find out how they conduct
tests and learn from it. You don't have to accept everything they (or I)
say, but you'll be in a better position to assess comparative reviews in the
future.

None of the following sites has the universal, unquestioning approbation of
the entire anti-virus research community, but they are taken seriously:

The paper I mentioned earlier includes a number of references and further
reading resources, for those who want to know more about this difficult but
fascinating area.

Insider's Guide to Comparative Anti-Virus Reviews

Additional resources