1. Sparse Binary Polynomial Hashing
    and the CRM114 Discriminator

  2. William S. Yerazunis, PhD.1
    Mitsubishi Electric Research Laboratories
    2
    Cambridge, MA 02139 USA

Version 2003-01-20

Abstract: Bayesian statistical mail filtering to sort incoming messages into valid and invalid sets is rapidly becoming the method of choice for single-ended antispam filters. In this talk, we will examine the Sparse Binary Polynomial Hash (SBPH) filtering technique, a generalization of the Bayesian method that can match mutating phrases as well as single words. As implemented in the GPLware "CRM114 Discriminator", and combined with supervised learning, SBPH can deliver >99.9% accuracy on real-time email without whitelists or blacklists, from as little as 500K of pre-categorised example text. A short explanation of the CRM114 Discriminator language will be included, as well as some limiting factors to accuracy.

Introduction

Spam is a significant threat to the continued usefulness of email as a dependable communications medium. Because spam (sometimes termed “unsolicited commercial email” or “marketing messages”) is neither expected nor desired, the natural human response is to delete it unread. Unfortunately, the rapid deletion of email without careful reading will often result in the accidental destruction of some fraction of the desired (and important!) messages (a “spam-spasm”).

Spam sometimes contains material that is inappropriate for the receiving environment, such as blatantly sexual messages in a professional workplace (where gender harassment rules may be in place). Other spams may violate other workplace regulations, such as advertisements for investments or gambling.

Almost all of the cost of a spam is incurred by the recipient; at 2 seconds human time per mail to delete spam almost unread, and paying the minimum wage of $5.15 an hour, a million spams costs the recipients about $2,800 collectively. (The actual cost is much higher, since most email recipients make far more than the Federally mandated minimum wage.)

Lastly, spam has a nonzero IT costs. Spam has a cost of bandwidth required to move it, processing power to manipulate it, and storage space to store it (and archive it). These costs are nontrivial for many users and organizations, especially large ISPs, and can be considered a form of a denial-of-service (DoS) attack against the computational resources of an organization.

Previous Methods of Spam Supression

Previous spam suppression methods include blacklists of known spammers and known spam phrases, blacklists of known spam messages (usually in the form of hashed signatures of entire documents), keyword searches, heuristic spam feature matching, and confirmed-originator systems (smart whitelists).

Blacklists of known spam origination sites have grown from small hand-maintained lists to large TCP-accessible databases such as ORBS and SPEWS, including some subscription-based services. These blacklists use the “big hammer” paradigm- in order to force an ISP to cut off a spammer's access, the entire ISP's address space is blacklisted, thereby destroying the ability of all of that ISP's users to send email. While this blackmail is often effective, some of these lists have been accused of political censorship purposes, especially to silence critics of the blacklist software's management.

Blacklists of documents, rather than spam originators has been used by creating databases of the hashed values of entire spam messages. Since only the hashed value needs to be transmitted, a database of a million spam signatures can be easily downloaded on a nightly basis. Any change to the message that alters the signature defeats this filtering method, so almost all spammers now include a short string of random characters to force a different digital signature on each piece of spam email.

Keyword searches are another form of blacklist, where a set of words is defined as “spam markers” and searched for in each incoming message. Any message containing the target keyword will be deleted whether or not it is genuinely spam or not. Spammers can defeat keyword searches by inserting strategically placed blanks, misspellings, or HTML comments into blacklisted keyword strings.

Heuristic matchers such as SpamAssassin3 use the human expert to come up with a set of features such as strings or other conditions, such as presence in a database. Each feature carries a different weight, either human-assigned or generated by some algorithm such as a neural net. If an incoming message exceeds some threshold, it is rejected as spam, otherwise it is presumed good. The basis of the feature set is still a human expert examining spam and nonspam messages for interesting discriminative properties, so only a relatively small number of features (in the low hundreds) is ever examined.

The reverse of these blacklist-based systems are the “whitelist” based systems, such as TMDA4. In these systems, all mail is presumed to be spam unless it originates from a known-good sender or contains a known-good passphrase. Advanced whitelisting systems will send a challenge to a sender whenever that sender is not known-good; if the sender replies appropriately, the mail filter adds the sender to the known-good list.

The new entry on the scene are the statistical filters5 such as Bogofilter6, spamfilt, et al. These filters are based on occurrences of single words as the “features” of interest. In context, CRM114 uses a derivative of these Bayesian filters with a more powerful feature extractor: Sparse Binary Polynomial Hashing.

The Mathematics of Sparse Binary Polynomial Hashing:

Sparse binary polynomial hashing (SBPH) is a way of generating a large number of features from an incoming text automatically, and then using statistics to determine the weights for each of those features in terms of their predictive values for spam/nonspam evaluation The features themselves are the results of all sparse binary polynomials up to a particular order N applied to a sliding window of word hashes.

We have investigated the following implementation of SBPH: we generate a sliding window of N words. For computational efficiency, each word is hashed to a 32-bit value, and the hashes are then used as the wn values in the following polynomial:

Zj = S1Nwncj,n mod 232

The coefficients cj,n for the n'th window element in the j'th polynomial of the N'th order are generated by the following rule

cj,n = if n = 0: 1 (j doesn't matter in the n=0 case)

otherwise: (( 1<<(n-1 ) ) & j << 1) | 1 ( j taken as a bitmap )

After all of the Zj values for a particular window position are calculated, we can shift the word hash values wn and so avoid the repeating the computationally expensive string hash.

Currently, the sliding window length N=5 words, so the coefficients cj,n are (blank cells are zeroes):

j

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

cj,0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

cj,1


3


3


3


3


3


3


3


3

cj,2



5

5



5

5



5

5



5

5

cj,3





9

9

9

9





9

9

9

9

cj,4









17

17

17

17

17

17

17

17



The coefficients and generating formula were chosen for the following reasons: each word hash value wj should affect all 32 bits of the output Zj (and so must have a 1 in the low order bit position), the effect of the same word hash value w in different wj positions should be different (and so must be different and nozero in at least one bit other than the low-order bit position), and the entire calculation must be very easy to compute (as it will be performed many times across the input text).

These coefficients are by no means the only possible coefficients of use; we have also tested the use of prime-number coefficients and found no performance difference. The only reason we use these sparse binary polynomial coefficients is that it's easy to write the generalized code to calculate this sparse binary polynomial hash for arbitrary N, and thereby experiment easily with different values of N.

Once this stream of Z-values is created, we can now either LEARN the stream's statistics, or CLASSIFY the stream according to previous statistics.

When LEARNing, we accumulate the number of times each Z value is seen in a pair of bucket files, one for spam, and one for nonspam. Rather than have two very large files (each 232 counters long), we take the 32-bit Z-value and fold it modulo the length of the bucket file. By default, the bucket files are 1 megabyte plus one byte long (again, this is to force every bit in the Z-value to have some affect on the output. A prime number length for the bucket files would be equally effective). In the current implementation, the buckets are unsigned byte values, and care is taken during the accumulation to stop incrmementing each bucket at the value 255 and not allow wraparound.

When CLASSIFYing, the relative counts in the spam and nonspam bucket files are used to calculate a rough conditional probability that the input text is representative of one bucket file or the other. The Bayesian Chain Rule is used to calculate an overall class membership probability.

The Bayesian Chain Rule

The result of the SPBH algorithm is a series of approximate conditional probabilities describing the hypothesized class membership of an incoming text. The Bayesian Chain Rule provides a way to combine the individual feature conditional probabilities into an overall probability.

The classical Bayesian Chain Rule (BCR) for class C and feature F is:

P (C|F) = P (F|C) P(C) / { P (F|C) P (C) + P (F|~C) P(~C) }

The numerator is a classic conditional probability multiplication, and the denomenator is the renormalization factor for a two-class system. Unfortunately, the Bayesian Chain Rule is accurate only when the different features F are statistically independent of each other. In the context of spam filtering, this is most emphatically not the case, so the BCR results are not mathematically correct. Fortunately, a message tagged as “spam” almost certainly is spam; the innacuracy is a matter of magnitude, not of class membership.

A computational implementation issue in that numerical underflow is easy to achieve in an SBPH/BCR evaluator on IEEE-compliant floating-point hardware. Once the probability goes closer than 10-20 of 1.0, the probability becomes numerically indistinguishable from 1.0 exactly. To avoid this, our implementation independently calculates and maintains both P (C|F) and P(~C|F) per feature, and recalculates the larger probability from the smaller. This avoids the significance underflow issue.

A second underflow issue occurs at roughly 10-318, where the total dynamic range of IEEE floating point is exhausted. Numbers smaller than this cannot be numerically distinguished from zero. Fortunately, when this form of underflow occurs, it is almost always certain that the correct category for the incoming text has already been determined.

The CRM114 Language

The CRM114 langauage was created specifically to write generic filters. The only data structure supported in CRM114 is the named string, which may exist in isolation, or with other named strings partially or completely overlapping. CRM114 does not directly support numeric or structured data. The basic operations of CRM114 are to perform regular-expression (regex) and SBPH/BCR matching on strings. The programs are block structured, with an arbitrary level of nesting, and variables are declared, initialized, and bound automatically as encountered, with global scope. Programs are monolithic; without locally scoped variables. There are no subroutines, but a powerful process forking facility replaces the functionality of subroutines and functions.

Instead of a positional or keyword-based syntax, the CRM114 parser is declensional – the role of a particular piece of text is determined directly by the kind of quotation marks surrounding it; this is a significant aid to writing “quickie” filters but also works nicely as a programming aid. Unquoted text is a command, words surrounded by colons like :this: are variable names, words in '< >' brackets are modification flags, variable names in '( )' are variables to be used (or affected) directly, variable names in '[ ]' restrict the active area of the command, anything in '/ /' is a regex. Beyond having the command word first, the ordering of these quoted strings within a statement does not matter.

CRM114 is a language expressly designed to create filters. By default, all CRM114 programs read their entire input stream till EOF and places it into a single variable before program execution begins. Variables are overlapping substrings within this large buffer. Subsequent variables are created either as start/range offsets within this buffer, or as separate, isolated variables.

Flow control within a CRM114 filtering program is done in one of five ways:

Alterations to a variable may be either nonsurgical (by matching and thereby binding the variable to a new part of a preexisting string), or surgical (by actually deleting and inserting bytes into the variable string). Surgical alter operations affect the variable specified and also affect any overlapping variables. CRM114 also provides a hash function, surgically replacing a variable's string space with the hexadecimal text representing a 32-bit hash of any string.

The syscall statement allows an arbitrary command line program to be invoked as a fork()ed process, to be fed an arbitrary string as stdin, and to have it's output captured as stdout. The invoked minion can be either run to completion before the syscall returns, or allowed to continue as a separate process, with unbounded communication and asynchronous execution.

The window statement is used to step the input buffer through appropriately sized chunks of an indefinitely long input stream such as a syslog. This statement deletes one regex's worth of text from the front of the input buffer, and then appends one regex's worth of text read from stdin to the end of the input buffer. Input and output do file I/O, and the accept statement copies the current input buffer to standard output.

The Mailfilter Program

Mailfilter.crm is a program, written in the CRM114 language, designed specifically to process RFC2822 messages- that is, common Internet SMTP mail, in the text-file format. Mailfilter can be used with almost any mail transfer agent and mail reading agent; the only requirement is the ability of the user to send themselves an email containing text to be recategorized along with an authentication password. Mailfilter executes three phases in sequence:

  1. Is this an email with the proper command/password set? If so, execute the command function specified. (this is how a user can remotely alter the whitelists, blacklists, and learn new forms of spam). If not, then...

  2. Is this an email that matches an entry in a priority action list, a whitelist, or a blacklist? If so, route the text content appropriately. If not, then...

  3. Classify the content usng the SBPH/BCR algorithm and the previously generated bucket files, and route the text based on the results of that classification.

Mailfilter.crm can either be run standalone, or altered to run as a subsidiary filter under procmail. In either case, it can be a very effective filter. In the author's testing, after training, the mailfilter.crm program exceeded 99.9% accuracy.

Specifically, for the period of November 1 to December 1, 2002, a pre-trained mailfilter.crm (as distributed as CRM114 version 2002-11-26 ) processed the authors live incoming mail stream, a total of 5849 messages (1935 spam, 3914 nonspam) , with only 4 false acceptances, 0 false rejections, and 2 messages considered “not humanly classifiable”. This was based solely on the SBPH/BCR classifier, without aid of whitelists or blacklists. Processing speed was approximately 20 kilobytes per second on a Transmeta 666MHz processor; memory required was under 5 megabytes. There were approximately 400Kbytes of training text in each of the spam and nonspam text corpi; each corpus generated approximately 400K features. For comparison, the human author's measured accuracy as an antispam filter is only 99.84% on the first pass. The ROC (Reciever Operating Characteristic) curve for the system is essentially a single point very close to [0,0], as far more than 99% of the specimen messages had probabilities either less than .000000001 or greater than .9999999999, and there are no 'magic tunable parameters' in the algorithm.

The author is currently engaged in a repeat experiment, retraining the SPBH/BCR from an empty state, to obtain empirical measurements on the rate of learning of the algorithm.

Targeting the Spammer's Business Model

In terms of actually solving the spam problem, the real problem is the spammer's business model. Spamming is simply a very cheap way of getting advertising leads, because almost all of the cost of getting the message to the target is paid for by the target itself. In order to stop spam, we have to render the spammer's business model untenable.

The typical commission for a spam mailing of one million spams is between $250 and $500 and gets about a 1 part per thousand response rate. Compared to bulk paper-email, this is very inexpensive; paper mail gets a 1 part per 20 response rate but costs 25 cents per message. This makes spam about 200 times cheaper than paper bulk mail (to the sender, that is, as almost all of the costs of spam are incurred by the involuntary recipient/victim).

In order to tip the costs of spamming away from the “recpient pays” model, we either need double-ended protocol changes (that is, internet “postage”), or we need single-ended filters that can force the price-per-response on spam from $250 / 1,000 (or 25 cents per response) to over the paper-mail costs of $.25/.05 or $4.00 per response. This requires a filter of at least 99.5% accuracy; SPBH/BCR is capable of five times that rate (pushing the cost of a single spam response up to $20 per respondent).

Note that SBPH/BCR is not the only filtering technology that can reach these accuracy levels; it is likely that many methods can. Development and deployment of multiple methods should be encouraged, as differing filter methods will have different weaknesses and avoiding a monoculture of spam filters is a good way to guard against catastrophic breakthroughs.

Limits to Accuracy of Spam Filtering

In terms of accuracy limits, there is a second factor beyond filter accuracy itself. Study of the typical spam stream over a long time period will demonstrate that spam is not an ergodic source- both the spam topics and the spammer's means used to deliver those topics evolve over time. Any static filter will eventually begin to fail as the spam stream diverges more and more from the training corpus.

We offer the following as a very rough empirical estimate of the rate at which new spams (or spam delivery methods) will appear to an observer:

newspams seen (total over timespan) = (spams/day)0.001 x days timespan

or, more empirically, for 100 spams per day, expect one to three completely new spams per month. This sets a lower bound on the achieveable error rate of a single-user spam filter; cooperative filters that share spam information across multiple readers may achieve lower per-user error rates, but that's only because of the vaccination effect: the first user to see a particular spam teaches the filter about that spam, thereby protecting all subsequent users.

Conclusions, Concerns, and Future Work

First, the good news. Good spam-filtering technology (SPBH/BCR or others) can provide a nearly spam-free email experience – that is, where a spam happens sufficiently rarely that one can't remember accurately when the previous one occurs. This completely invalidates the business model that supports spamming, which makes even weak filters work better.

Second, the bad news. SBPH/BCR may be computationally too expensive for widespread implementation at cash-concious ISPs. Faster algorithms may be necessary. Spam-filtering technology is merely one use of powerful text filtering. It is also possible that such powerful methods may become a tool for censorship or other socially undesirable situations.

In the future, we intend to explore the following topics:

  1. speeding up SBPH/BCR

  2. improving the probability estimate function's behavior, especially for features with low levels of evidential significance.

  3. considering better chain-evaluators than the naïve Bayesian Chain Rule. In particular, the assumption of independence of features in the naïve Baysian Chain Rule may be the cause of some inaccuracy; a chi-squared estimator may be more appropriate.

  4. Improvements to the functionality and usability of the CRM114 programming language, including 8-bit-safe modes and approximate matching (by integrating with one of the available GPLed approximate regex engines).

1Correspondence may be addressed to the author at wsy@merl.com ; include the word 'CRM114' in electronic correspondence so that it will be automatically whitelisted. Written correspondence should be addressed to the author, care of Mitsubishi Electric Research Laboratories, 201 Broadway, Cambridge, MA, 02139

2Except for live-email testing, this work was carried out independently of the author's affiliation; the author thanks Mitsubishi Electric Research Laboratories for their forbearance in messing with the mail servers during the statistics gathering phase of this work.

This document, like the software described herein, is copylefted under the appropriate Free Software Foundation license, in this case, the Gnu Document License.

3Justin Mason, et al, 2002, SpamAssassin, www.spamassassin.org

4Jason Mustaler, www.tmda.net

5Paul Graham, 2002, 'A Plan For Spam'

6Eric S. Raymond, 2002, 'Bogofilter', www.tuxedo.org/~esr/bogofilter/bogofilter.html