The Spam Filtering Plateau at 99.9% AccuracyandHow to Get Past It.*

William S. Yerazunis, PhD

Mitsubishi Electric Research Laboratories

( MERL )

Cambridge, MA

wsy@merl.com

* includes clarifications and examples expanded from presentation given at the MIT Spam Conference 2004

Spam Filtering State of the Art

Bayesian filters have become THE WAY.

There are more than a dozen available on Sourceforge alone.

Mozilla mail now includes a Bayesian option

SpamAssassin has an option to include a Bayesian heuristic.

The State of the ArtPart II

Most Bayesian filters report accuracy on the order of 99% to 99.9%

But none of the filters report accuracy past this level. There?s no gaussian tail.

How to get past this plateau at 99.9% accuracy

is

the ultimate goal of this talk

State of the Anti-Art

The typical Modern Spam Filter

Testing a Spam Filter

* this test set is a severe torture test. The author scores less than 90% accuracy on this test set.

What Training Method Works Best?

Training method error count

(all using SBPH) (low is good)

TEFT (Train Every Thing)

TOE (Train Only Errors)

TUNE (Train Until No Errors)*

What Training Method Works Best?

Training method error count

(all using SBPH) (low is good)

TEFT (Train Every Thing) 149

TOE (Train Only Errors) 69

TUNE (Train Until No Errors)* 54

*residual error due to cut off in training at 3 out of 41470. Because TUNE requires keeping all prior emails as part of the retraining corpus, it becomes intractable for large installations.

Why is Bulk Training Suboptimal?

Is Forgetting Good?

Yes

Features in a corpus can change polarity. Forgetting old data allows the database to track evolution in spam more accurately

Is Forgetting Good?

Yes....

BUT

Forget as little as possible.

Don?t groom all of the hapaxes out an entire database. Instead, randomly delete only a few, and only as needed to make space for incoming features

This Yields A > 3x improvement in accuracy

over block purge or hapax purge database cleaning.

What Eval Algorithm Works Best?

method error count

(TOE training) (low is good)

First Order Bayesian*

Peak Window Value Only (w=5)

Token Sequence Sensitive (w=5)

Token Grab Bag (w=5)

Sparse Binary Polynomial Hash

Markovian with 22n weighting

* using all features not top 1000

What Eval Algorithm Works Best?

method error count

(TOE training) (low is good)

First Order Bayesian 92

Peak Window Value Only (w=5) 80

Token Sequence Sensitive (w=5) 78

Token Grab Bag (w=5) 71

Sparse Binary Polynomial Hash 69

Markovian with 22n weighting 56

(the winner in all single-pass

techniques so far)

How good is MarkovianSpam Filtering?

My Current Statistics with CRM114 using a 22n Markovian hover around 99.9%

4 weeks (dec 15 - Jan 12) Raw Scores:

Total Spam 4677

Total Nonspam 4385

Total Mail 9062

False Accepts 6

False Rejects 2

Human Can?t Decide Either 3

N+1 Accuracy 99.90%

But last year you had 99.91% accuracy (N+1).What Happened?

1) new error source: Penetration of Well-Credentialed Lists

2) nearly TRIPLE the rate of incoming spams:

last year: 1140 spams

this year: 4677 spams*

3) my upstream started discarding DNSRBL spam so I lost a lot of low-hanging fruit.

* upstream DNSRBL discards at ~50% of all mail

Where did the Errors Happen ?

False Accepts 6 2 = 4

(2 spammers got onto previously well-credentialed lists)

False Rejects 2 2 = 0

(2 users violated rules on said lists

and were summarily bounced)

Note that it?s almost impossible to tell the difference between the two cases!

Arguable N+1 Accuracy for Markovian Filter:

99.95%

  1. A Markovian discriminator tries to match the incoming text against the hidden markov models of the two text corpi.

  1. We do NOT try to actually calculate that hidden markov model (because of tractability issues)

  1. The longer a chain we match (even a chain containing a few errors) the stronger the evidence for discrimination.

How a Markovian is Different

One reason why a Markovian is better

Consider the Perceptron Theorem*

a linear combinational decision algorithm can NOT discriminate the case:

A or B but not BOTH.

a cross-product decision algorithm has no such limitation.

* Minsky and Papert, perceptrons, 1969

Handwaving Mathematics

If the weights of the Markovian terms are superincreasing (such as 22n), then long corpus chains can overrule single words and short chains.

This makes the Markovian filter equivalent to a cross-product decision algorithm, capable of nonlinear filtering without an intermediate layer of computed metafeatures.

  1. change the feature generator from single words to spanning multiple words *

  1. Change the weighting so that longer features have more weight (ie. Longer features generate local probabilities closer to 0.0 and 1.0)

  1. The 22n weighting means that the weights were 1, 4, 16, 64, 256, ... for span lengths of 1, 2, 3, 4, 5 ... words

  1. * Rohan Malkhare at USF has a very nice extension of this to a statistical model of an entire message..... he has been advised to publish As Soon As Possible.

How to Turn a Bayesian into a Markovian

Markovian Example

Given the text:

The quick brown fox jumped ....

The Markovian features are:

Feature Text weight

The 1

The quick 4

The <skip> brown 4

The quick brown 16

The <skip> <skip> fox 4

The quick <skip> fox 16

The <skip> brown fox 16

The quick brown fox 64

...and so on

How to Use the Weights

If your Bayesian local probability is:

good - bad

Plocal = 0.5 + -----------------------

good+bad+1

Then the equivalent Markovian local probability is:

(good bad) * weight

Plocal = 0.5 + ----------------------------------

(good+bad+1) * weightMAX

But even a full Markovian is not enough

A Markovian filter makes fewer errors than a Bayesian filter

by about the same margin as

a light beer has fewer calories than a regular beer.

Preprocessing to Help Filtering?

Current State of Affairs

With All of these assists, the best we have done is 99.95%

What?s the NEXT step?

A Few PossibilitiesFor the Future

Authenticated Senders

Defense in Depth(Multiple Layers)

One Man?s Pain isAnother Man?s Pleasure

-Marquis de Sade

One Man?s Pain isAnother Man?s Pleasure

-Marquis de Sade

INOCULATION is a means of using the pain of one spam recipient to protect a large number of other recipients.

Inoculation Basics

Inoculation Mechanics

Inoculation Results

The Email Minefield

Integrating Email Minefields

Integrating Email Minefields (2)

Minefield Results

How well do Email Minefields work?

- We don?t know! We?re still working through how well Inoculation itself works.

-- But we?ll let you know....

Theoretically*, accuracy should improve linearly with the number of people you share inoculation data with (e.g. 10 people gives you 10x accuracy)

* But that?s only theory.

Just-In-Time Filtering

Current email delivery systems filter upon arrival (so-called SMTP time).

This is suboptimal for systems with Inoculation or Minefielding

observation- some optimized spammers will hit every account on a small site in less than ten seconds.

this isn?t enough time to allow an inoculation to propagate

Just-In-Time Filtering (2)

Conclusions:

UNPROVENHYPOTHESIS

Bayesian/Markovians with 100?s of users sharing inoculations, with minefields, and with Just-In-Time filtering could reasonably get to five-nines (99.999%) accuracy, and possibly approach 99.9999% (one error per million emails) accuracy.

Thank you all!

Are there any questions?

Handy Web Sites:

http://www.camram.org

http://www.paulgraham.com

http://crm114.sourceforge.net

Summer (that means it?s warm) Spam Conference:

http://www.ceas.cc