Paul Graham has written another article about the work he is doing with his Bayesian spam filtering software these days. I recently sent him the following email describing a potential improvement to the software.Paul,I’ve really enjoyed reading the articles on your Bayesian filter software that you have written. While I haven’t gotten a chance to use the software myself (I use the built in Junk filter functionality of the Mail app on Mac OS X), I find the general concept and the ideas that you present very interesting.Based on my own experience with Apple’s junk mail filter, I had an idea after reading your discussion on the false positives. Rather than having just one threshold (ignoring the token count threshold) for determining whether the mail stays in the Inbox or is filtered to the Trash or Junk folder (I believe you mentioned you had a threshold of .9 in the article), a second level threshold (say .8 or .85) that filtered mail into a “Potential Junk” folder could potentially be a big time saver.With Apple’s current implementation, I still need to search through the Junk folder to make sure a false positive didn’t get through. This involves scanning 100’s of spam messages for that possible needle in the haystack. This can be a very difficult task since a false positive would theoretically look very similar to the rest of the true positives (and the messages cannot be sorted based on their Junk potential rating). With a “Potential Junk” folder, I could save a lot of time since I would only need to skim through a considerally smaller number of messages, and those messages would be more likely to be non-Junk. I could then potentially ignore the messages in the Junk folder, or just put off the task until I had more time to go through it. Of course, determining an acceptible second level threshold might require some experimentation, and there’s always a chance that a false positive will not even reach the lower threshold, but the decision to look at or ignore the messages in each folder would be up to the user.The other benefit to this is that if you increase the upper threshold a little, you may reduce the number of false negatives in the Inbox. They would probably be more likely to be placed into the “Potential Junk” folder. Of course, a little experimentation would be needed here, too, to verify this claim.I don’t think it would be a very difficult feature to implement… probably just another if statement within the filtering logic (the same behavior could be implemented by running the messages through the filter twice with different thresholds – but that wouldn’t be very efficient). The idea could be even be expanded to use multiple thresholds to filter into even more folders (e.g. 0-10%, 10-20%, 20-30%, etc.) but I think that the three zones (Junk, Potential Junk, Not Junk) is a good trade off between complexity, time saved, and usefulness. Rather than treating the issue as black and white, I think black and white with a little gray in between is a little more realistic and useful. The software doesn’t need to know all the answers. It just needs to know enough to be able to make reading email tolerable.Best Regards.-brahm
Hello Brahm, I just flew over the article – have you ever heard of a software called spam-detective? This program acts as a local pop3-server and manipulates the subject-line, when it thinks to detect spam. The formulas to detect spam can be edited by the user. The spam factor is displayed in percent. … needs M$-Windows
J??rg,I had not heard of Spam Detective. Thanks for bringing it to my attention. In searching for more information, I found that it has been <a href="http://www.emtec.com/spamdetective/" rel="nofollow">bought</a> by a company called <a href="http://www.lyris.com/" rel="nofollow">Lyris</a> and renamed to <a href="http://www.lyris.com/products/mailshield/" rel="nofollow">MailShield Desktop</a>. I don’t run Windows on any of my machines at home, so I don’t think I will be able to use it.I’ve also heard that the Perl-based <a href="Spam Assassin" rel="nofollow">SpamAssassin</a> is a very popular (and effective) open source spam filtering package.I’ve been pretty happy with the filtering functionality that is built into the Mac OS X Mail application, so I haven’t had a need to look much further.