Words by c.z.robertson

Bayesian spam-filtering and pessimism

2002-12-12 13:15:19 UTC

Jeremy Bowers is very pessimistic about Bayesian spam-filtering techniques. Now, I do have some sympathy with his position. While I think Bayesian filtering will be an improvement over rule-based filtering, I don't think it's going to magically solve the spam problem. However, I do think he's being overly-pessimistic here.

Probability-based classification is currently on the cutting edge of Artificial Intelligence. It is one of the best known techniques for this sort of classification. At the top end of AI, the various techniques all perform about the same, with limited strengths and weaknesses, and Bayes-type classification is the best available for text classification of this kind.

If probability-based filtering fails, there is nowhere else to go in the realm of automated filtering. There is no next step in the automated-filtering arms race. This is it.

I accept that Bayesian filtering is the latest stage in an arms race, but I don't believe that it's the last stage. And I don't believe that the next stage needs to be anything approaching full human-level AI. In the same way that Google's PageRank system changed the way in which web searching is done, it's conceivable that new statistical techniques could be developed for spam-filtering. Now, I can't imagine what form they might take, but that's the great thing about innovation -- you can't imagine it beforehand.

Furthermore, there are other systems which I would call "automated" which don't operate on text in the way that rule-based and Bayesian systems do. Things like web-of-trust, for example. Jeremy Bowers is certainly aware of these things since he mentions them at the end of his post. If (or perhaps when) text-based filtering fails, we will still have these systems to fall back on.

[Y]ou've effectively lost the ability to scan through your email and detect spam better then any filters, because all the obvious cues are gone (you know, like BUY EMAIL LISTS FROM A GUY WHO TYPES IN ALL CAPS!!!!!!!). ... Expect a ton more messages about "your website" or "Remember me from high school?" or any number of other mundane things. (Already this is getting to be a real problem, and the Bayes filters aren't even here yet!)

I already get spam mails with subjects like "Sorry about everything.call me." or "have you heard?" or "Tomorrow". But it's still trivially easy to identify them as spam. As a rule of thumb, if it arrives in my inbox (as opposed to being filtered to one of the folders for the countless mailing lists I'm on) and I don't know who it's from or what it's about then it's spam. The only spam I've been fooled by in recent memory was the HugeCrush one, which was actually a rather interesting case.

The new filters will fall down on spam crafted to get past them just as simple word filters have failed, repetition filters have failed, community filters have failed, and every other analysis technique has failed. Sadly, when those other techniques failed, the spam became easier to identify by visual inspection, with the random letters appended to the end and such. Bayes-type filters are the first I know of that will push the spam in the opposite direction, towards total stealth.

The funny thing is that in theory the rule-based systems should also have pushed spam towards being more stealthy, but that hasn't actually happened. I don't think it's that easy to predict how spammers will respond to Bayesian filtering. I don't actually think that Bayesian filtering is so good that anything that can get round it will look like non-spam to humans.