What does a spam fighter do all day? Part 1

I was recently thinking about what a person who fights spam (like me) does all day.  In other words, what is a day in the life of a spam analyst like?

The question for me is two-fold, because the stuff I do now is quite different than when I first started three years ago.  So, I'm going to break this into two posts; one for what I do now and one for what I did back then.  The idea is to break down how we go about fighting spam.

Back when I first started, after a while I did three tasks daily:

  1. Process false positive submissions
  2. Process spam abuse submissions
  3. Process IP blocklist candidates and delistings

The method of going through the false positive inbox was in itself a process:

  1. Separate the messages into cause of filtering (ie, was it filtered by a spam rule, or one of our automated processes?).  I wrote back-end scripts to do this.
  2. Move joke messages or obvious spam into the not-valid folders. I wrote a back-end script to do this as well, it was the first one I wrote.
  3. Manually go through and separate the wheat from the chaff.  You wouldn't believe how many invalid submissions come to the false positive inbox.
  4. Go through the messages one by one and fix the broken rule.  I wrote some partial automation scripts to do this as well, so once I fixed one, any others that were caused by the same rule were subsequently moved.  Since I had already pre-sorted them I could assume all messages were valid.

My goal for false positives was to divide and conquer.  I tried to automate as much as I could but the biggest part of what I did was still separating good mail from bad mail.  It didn't help that the file system was slow, which is why I wrote the automated scripts to begin with.  For the most part, I could keep up with all false positives every day.  On Mondays, I usually saw between 1500 - 2000 submissions, and that always went lower throughout the week with Fridays being the lowest day.  For some reason, Wednesday was often higher than Tuesday.

I became very good at looking at a message and without even looking at the headers, I could tell why a message was filtered.  Even when we used automated filters that cause the false positive, I could still tell which one it was because each tended to hit the same types of mail.

Also, because I modified so many spam rules that were written by humans, I became very good at predicting what spam rules written by others would be effective and which ones were likely to cause false positives.  For example, consider the word "stocks".  A spammer could spell it like any of the following:

stock.s stock-s stock^s stock!s

It didn't take me long to figure out rules-of-thumb for writing rules on obfuscated phrases.  While I might be tempted to write a rule like the following:

\bstock\Ws\b

That's a regular expression where \W can be any non-letter, non-number character.  This doesn't work well because the following word is legitimate:

stock's

There are a number of nuances like that when writing regular expressions to match words.  My experience with the predictive elements of false positives is something that stuck with me after I moved on from writing rules to Program Management.