Features in bayesian spam filtering
2009-04-17 05:01:24 GMT/BST
I've recently switched from using CRM114 as my spam filter to SpamAssassin. I wanted to take advantage of systems like Razor and Pyzor and I wanted to apply some whitelisting. On the other hand, I'm inclined to think that a bayesian classifier approach to content filtering makes a lot more sense than SpamAssassin's collection of weighted tests. I don't know how the weights of SpamAssassin's tests were calculated. They seem very precise, but I wonder about their accuracy. Furthermore, you might expect their accuracy to vary from person to person.
So here's what I was thinking: The weights of the tests should be calculated in a bayesian way. Run each test over the email and if it triggers then add it as a feature for bayesian consideration. Currently all the bayesian spam filters that I'm aware of only use words (or some tokens based on words) in the email as the features they consider. But I'd like to know how much spaminess is implied by an email coming from a machine with no reverse DNS. SpamAssassin gives that a score of 0.1 (with a score of 5 indicating spam, by default), so it's got some sort of implicit notion of probability, but I'd like to see that probability calculated using bayesian techniques.
Funnily, while writing this post I realised that I'm not the first person to come up with this idea:
Specific spam features (e.g. not seeing the recipient's address in the to: field) do of course have value in recognizing spam. They can be considered in this algorithm by treating them as virtual words. I'll probably do this in future versions, at least for a handful of the most egregious spam indicators. Feature-recognizing spam filters are right in many details; what they lack is an overall discipline for combining evidence.
That's from Paul Graham's A Plan for Spam, the essay that brought the world's attention to bayesian spam filtering in the first place. If anyone knows of a system that does this then please let me know.