William Yerazunis1, Fidelis Assis2, Christian Siefkes3, Shalendra Chhabra1,4
1: Mitsubishi Electric Research Laboratories, Cambridge MA
2: Empresa Brasileira de Telecomunicações Embratel, Rio de Janeiro, RJ, Brazil
3: Database and Information Systems Group, Freie Universität Berlin,
Berlin-Brandenburg Graduate School in Distributed Information Systems
4: Computer Science and Engineering, University of California, Riverside CA
Abstract: We consider the well-known K-nearest-neighbor (KNN) classifier as a spam filter. We compare KNN-based classification in both equal-vote and decreasing-rank forms to a well-tested Markov Random Field (MRF) spam classifier. We also consider a variation of KNN classification based on a high-dimensioned feature radiation-propagation model (termed a “hyperspace” classifier), and compare the hyperspace classifier performance to the KNN and MRF classifiers for both accuracy and speed.
Spam classifiation continues to provide an interesting, if not vexing, field of research. This particular classifier problem is unique in the machine-learning field as most other problems in machine learning are not continuously made more difficult by intelligent and motivated minds.
A number of different approaches have been taken for spam filtering beyond the single-ended machine learning filter and a reasonable survey really requires a full book [Zdzairski 2005] ; in this paper we will restrict ourselves to post-SMTP acceptance filtering and the sorting of email into two classes: good and spam.
Even within post-acceptance, single-ended filtering, there are a number of techniques available. One of the most common is a Naive Bayesian filter (usually using a limiting window of the most significant N words). [Graham 2001]. Other common variations use chi-squared analysis, a Markov random field [Chhabra 2004, Bratko 2005], or even compressibility of the unknown text given basis vectors representing the good and spam classes [Willets 2003].
Although KNN filters [Fix and Hodges, 1951] [Cover and Hart, 1967] [ have been considered for spam classification in the past ( John Graham-Cumming’s early POPfile used a KNN) they have fallen into disfavor among most filter authors. We reconsider the use of KNNs for classification and attempt to quantize their qualities.
One disadvantage of standard Cover and Hart style KNNs is every known input is added to the stored data; this can cause very long compute times. To mitigate this, we have used selectively trained KNNs; rather than adding every known text immediately to the stored data, we incrementally test each known text and only add the known text to the stored data if the known text was judged incorrectly. This speeds up filter classification tremendously.
Those familiar with Cover and Hart’s limit theorem should realize that this modification results in a KNN that the Cover and Hart limit theorem does not necessarily apply to. We will consider extension of the Cover and Hart theorem to cover incremental trained KNNs in future work.
We used several different filtering algorithms in our tests. These are KNN filters with neighborhood sizes of 3, 7, and 21, and we compare this the Markov Random Field (MRF) filter and with a new filter variation based on luminance in a high-dimensional space, called the “hyperspace” filter. We will discuss the algorithms in detail below.
All of these filters are constructed within the framework of CRM114 with only minor tweaks to the
source code. CRM114 is GPLed open source software and can be freely downloaded from:
so the reader should feel free to examine the actual algorithms.
All of the filters tested were configured to use the same set of features extracted from the text of the spam and good email messages. This is the OSB feature set described in [Siefkes 2004], and experimentally verified to be of high quality as compared to other filter feature sets [Assis 2005].
To summarize how these features are generated, an initial regex is run repeatedly against the text to obtain words (that is, the POSIX-format regex [[:graph:]]+ ). This provides a stream of variable-length words. This stream of words is then subjected to a combination operator such that each additional word in the series provides four combination output strings. These output strings are then hashed to provide a stream of unsigned 32-bit features. These features are not truly unique, but for our purposes they are “unique enough”. All three filters were given identical streams of 32-bit features.
The string combination operator that generates the 32-bit tokens can best be described as repeated skipping and concatenation. Each word in the stream is sequentially considered the “current” word. The “current” word is concatenated with the first following word to form the first combination output string. The same “current” word is then concatenated with a “skip” placeholder followed by the 2nd following word to form the second combination output string. The current word is then concatenated with a “skip skip” placeholder followed by the 3rd following word to form the third combination output string. Then the current word is concatenated with a “skip skip skip” placeholder followed by the 4th following word to form the fourth combination output string. Finally, the “current” word is discarded and the next word in the input stream becomes “current”. This process repeats until the input stream is exhausted.
As an example, let’s use “For example, let’s look at this sentence.” as a sample input stream. The input stream would be broken into the word stream:
That word stream would produce the following concatenated strings which would then be hashed to 32-bit features for input to each of the classifiers:
For skip let’s
For skip skip look
For skip skip skip at
example, skip look
example, skip skip at
example, skip skip skip this
let’s skip at
let’s skip skip this
let’s skip skip skip sentence
at which point the input stream is exhausted. In the actual system, “null” placeholders are used to both pre-fill and post-drain the pipe, so all words are equally counted.
This method of feature extraction and token generation has been shown to be both efficient and capable of producing classifier accuracies far superior to single word-at-a-time tokenization [Chhabra 2004]. In all tests described in this document, we used the “unique” option, so that only the first occurrence of any token was considered significant.
The KNN filter used was configured in several different ways. All of the pure KNN configurations were based on an N-dimensional Hamming distance metric; that is, the presence of the same feature in both a known and an unknown document is meaningless; rather the number of differences (specifically, the feature hashes found in one document but not the other) determine distance. Within the three neighborhood sizes of 3, 7, and 21 members, we tested two different configurations – equal-weight (the standard Cover and Hart model), and a declining-weight model based on distance.
In the equal-weight configurations, the set of the K closest matches to the unknown are considered as “votes” for the unknown text; each vote is cast in favor of the unknown text being a member of that example’s class and with each member of the K closest matches getting an equal vote. In the distance-based weighting, the weight of each vote was the reciprocal of the Euclidean distance (defined as the square root of the Hamming distance) from the unknown text to each known text. In both configurations, the votes are summed and the unknown is assigned the class of the majority vote.
The MRF filter configuration was the CRM114 Markov random field classifier using the OSB feature set and otherwise unmodified. This is the same classifier as described in [Assis 2005] and was not modified for these tests. In this classifier design, the clique size of the Markov random field is the same as the length of the feature generator – in the current implementation, this is 5.
The hyperspace filter is based on the KNN filter concept, but sets K = | corpus | - that is, the neighborhood is the entire corpus, and does not use an equal-weight votesystem. In the limit of an equal-weight voting, this causes all unknown texts to evaluate identically to the Bayesian prior (that is, the probability of an unknown text being a member of a class is proportional to the membership counts for the known membership of the class). Since all members of the learned corpus are “members” of the neighborhood, there is no question determining the size of the neighborhood needed for best filtering.
We tested several weighting variants for this hyperspace filter, among them proportional to the similarity between two texts (that is, the number of shared features), dissimilarity (that is, the Euclidean distance between texts mapped into a hyperspace where each feature is it’s own dimension), and variants and mixtures of these weights.
All of the filter configurations were tested against the SA2 corpus as distributed by Gordon Cormack [Cormack 2004]. The SA2 corpus can be freely downloaded from:
This test set is also used in the TREC 2005 conference testing for filters. However, comparisons between these test results and those published in TREC are only somewhat appropriate; note that the published TREC test set used filters that were trained in a single pass only; here we will do multipass training to observe asymptotic behavior so we should expect that multipass accuracy should exceed the TREC single pass accuracy.
This test set contains approximately 4150 good emails and 1900 spams; this asymmetry of content is significant in that the test itself must accomodate unequal class sizes in the training corpus. In preliminary testing, we considered first an alternating scheme (training first one good email, then one spam email) and ceasing training when one class became exhausted. We also tested a training scheme where we “reloaded” an empty training class whenever all members were trained, and a third training scheme where the choice of the next text to be trained was controlled pseudorandomly and so both classes would be exhausted simultaneously. None of these variants made any significant difference in base testing, so for simplicity we will consider only the pseudorandom sequencing as this most closely mimics actual spam filter usage.
Each of the subject filters was tested with 10-fold validation – that is, the filter was trained on 90% of the SA2 corpus, with 10% held back, then the remaining 10% was tested to give an accuracy result. As training was not permitted on the final 10%, we then repeated the training phase. We allowed five passes of such training for each filter, to observe any convergence to an asymtotic behavior. In the one case where convergence did not seem to occur, we ran an additional 5 passes and found that in fact convergence had occurred.
After these five passes to converge, the statistical memory is cleared, and the process repeated with a different 10% selected to be held back from training and to be used for testing. The result is a total of ten different training corpora are tested, and each text in the corpus is actually trained nine times and used as a test case exactly once.
The training procedure for each filter is a SSTTTR (Single-Sided Thick Threshold Train and Retest) training process. Each text is sequentially tested to verify that it is correctly classified. If the text is classified correctly by at least the thickness of the threshold, no further training is needed on that text. Otherwise, the text is trained into the correct class, and the test classification repeated. As before, if the text is not classified correctly by at least the thickness of the threshold, the text is trained into the correct class and execution for that particular text ends. Except where noted, all of the results presented here us a very thin “thick” threshold of 0.001 weighting units; this threshold is well below the resolution limit of 1 part in 42 possible in a K=21 KNN classifier (the K=21 KNN classifier is a voting classifier with 21 votes, since votes are quantized, a K=21 classifier cannot represent any finer gradation of certainty than half a vote, which is 1 part in 42). The reason for this very thin but nonzero threshold is so none of the classifiers will “stick at zero” in the symmetrical situation. We have tested thicker and thinner thresholds, and conclude that this thickness is trivial and the results are statistically similar to true zero-threshold training..
The reader will note that this is a somewhat different training regimen than is typically used for KNN classifiers; many KNN classifiers simply “dump” all known texts into the class database and classify based on that large database. We found that the better-behaved learning KNNs learned about ¼ of the documents (thus gaining considerable speed); the weaker-learning KNNs essentially learned the entire corpus (sometimes learning a particular document more than once!) and arrived at the simple KNN full-corpus database behavior eventually.
Overall testing was performed on a Dell 700m 1.6 Ghz subcompact laptop running Fedora Core 4 updated to the January 15 releases, and CRM114 version 20060209-ReaverSecondBreakfast (TRE 0.7.2 (GPL)). This machine was equipped with only 512 megabytes of memory and a slow (laptop) disk, so the processing speeds given below will almost certainly be at slow end of the scale for modern servers.
Our testing gave the surprising result that the KNN equal-vote filter was very weak in performance, for all tested neighborhood sizes ( 3, 7, and 21 members) . The learning curves showed that for neighborhood sizes of 3 and 7 neighbors respectively, that learning did not really occur past the second and third passes; the overall accuracy was no better than 93% (stdev 1.55 and 0.85% respectively) with a distinct split in accuracy - 98.6% for good email and 79.8 for spam.
With a neighborhood of 21 members, accuracy continued to improve past the 5th training pass and the KNN accuracy moved to 94% overall (standard deviation %1.06%). Again, there was a distinct split in accuracy of 96.1% for good email versus 90.9% for spam. The initial score for all classifiers (one pass rather than the full five passes) was abysmal, often at less than 90% overall. All three of these KNNs added most of the input corpora to their database, yielding slow runs (47 to 72 milliseconds per classification)
We now consider non-equal-weight KNN voting schemes. The simplest of these uses the Euclidean distance (the square root of the Hamming distance) as a measure of closeness; as we wish to weight votes according to closeness, the actual per document weight isfor each of the K documents within the neighborhood. This gives an improvement for K=7 and 21 to 94% and 92% accuracy overall respectively (standard deviations of 1.05% and 1.93%) and a symmetry split of 85% and 79% for spam to 98% and 98% for good email. This decrease in accuracy with increasing size of neighborhood is slightly paradoxical but again, the error bars on these results overlap so no real conclusion should be drawn.
Figure 2: Learning curves for KNNs with neighborhoods of 7 and 21 members with a per-document weight of the reciprocal of the Euclidean distance
We now consider non-equal-weight KNN classifier systems with- that is, where all known documents are considered to be “in the neighborhood” and the weighting function determines the significance of the document. This provided significantly improved accuracy and a much better asymtotic behavior. Because the filter accuracy increases, the learning time decreases, and the known-text database shrinks; the overall filter not only becomes more accurate, but also faster.
Here, we weight each known document with 1 / Euclidean distance between the known and unknown documents. The final behavior for this configuration is 97.8% overall accuracy (0.76% standard deviation, with good mail at 98.2% accuracy with spam at 96.9% and using 32 miliseconds per text. This result is much closer to what we expect out of a modern spam filter and in line with the results reported in TREC 2005 for the middle-group spam filters.
Fig 3: KNN with full-corpus neighborhoods, and voting weights of the reciprocal of the Euclidean distance
Another weighting for full-corpus KNN classification is to weight proportionally to the similarity between the unknown text and each known text example, and the improvement is immediately visible. This results in a final 5-pass accuracy of 99.24%, with a standard deviation of 0.4% (good mail accuracy of 99.4% and spam accuracy of 98.84%) at 44 milliseconds/text. Because the system did not seem to have reached an asymptote at 5 passes, we continued this experiment out to 10 passes, yielding only the statistically insignificant change to 99.27% overall accuracy (0.49% standard deviation) with good email at 99.49% and spam at 98.79% accuracy. Note the scale of this graph (and all subsequent graphs) is changed from the previous scaling; the Y axis has changed from 70% accuracy to 95% accuracy.
Note that a 99.27% accurate system is in line with the better TREC evaluations, however this system was trained with 10 full passes, while the TREC evaluations were trained on only a single pass.
Fig 4: Full-corpus KNN classification with similarity vote weighting. This plot shows the full run of 10 passes as the filter seemed not to have reached asymptotic performance at five passes, however there is no statistical improvement between 5 and 10. Note the Y-axis scale change; full height is now 95% to 100% accuracy.
We now consider a new variation on the K=|corpus| classification, where the per-vote weights contain not only a distance term but also a term related to the feature-wise similarity of the documents. Because these weighting models are mathematically similar to radiative energy transfer in high-dimensional spaces, we will call them hyperspace models for lack of a better term. Other than that change, they are very similar to other K=|corpus| KNN classifiers.
The actual radiative transfer from each known text in hyperspace to the unknown text is a property of the similarity of the known and unknown documents (as before, this is the count of unique features found in both documents). These models all use a weighting of the form:
where d is an independent parameter determining how important the document similarity is (document.
The first three terms of this series of weights (for d = 1, 2, and 3) all produce amazingly similar results.
Testing with d = 1, 2, 3 yield final overall accuracies of 99.34, 99.27, and 99.26% respectively (with standard deviations of 0.5, 0.3, and 0.5% respectively), with good-mail accuracies of 99.64, 99.66, and 99.59% accuracy, and spam mail accuracies of 98.68, 98.42, and 98.52% respectively. It cannot be determined which d-value is better with this experiment, because all three d-values yield results with substantially the same error bars. These are also among the fastest KNN classifiers tested here, with per-message processing times of 32, 22, and 22 milliseconds respectively.
Fig 5 – KNN Hyperspace weighting, for d=1, 2, and 3 respectively. Vertical scale is again from 95% to 100% accuracy
We now compare a KNN filter against a Markov OSB filter with the same training regimen and the same data. We performed two experiments- one in thin threshold training (that is, identical to the best KNN test training as above) and another with true thick threshold training, corresponding to 10 pR units. Because pR relates to a probability uncorrected for the naive Bayes assumptions, it is impossible to give an exact translation to probability, but empirically, scores within +/- 10 pR units of zero correspond very roughly the bracket between 10% certainty and 90% certainty. It is known that thin-threshold training is suboptimal for Markov OSB filters but thin-threshold training of a Markov OSB classifier is the “apples to apples” comparison.
The results are shown below. Note that the thin-threshold Markov OSB performance asymptote is 99.11% (stdev 0.26%) with good email accuracy of 99.64 and spam at 97.94 and 31 milliseconds/message, and as such is probably inferior to the best hyperspace and KNN classifiers.
However, thick-threshold Markov OSB, trained with a thickness of 10 pR units, yields the best-so-far accuracy of 99.52% overall (standard deviation of 0.28%) with good email accuracy of 99.64% and spam accuracy of 99.26%. This was also one of the fastest classifiers, spending just 19 milliseconds on each of text.
Fig 6 – Markov OSB classification used in the same training harness and test set as the KNN and hyperspace filters previously tested. Left: thin-threshold of 0.001 pR units; Right: thick-threshold trained at 10.0 pR units. Note that with a thick threshold, Markov OSB learned very rapidly.
Although small-neighborhood K-nearest-neighbor classifiers are not well suited for spam classification, KNNs using the entire corpus as the neighborhood and a distance weighting for the voting summation deliver good results. The hyperspatial variant using a similarity term based on shared document features as well as a distance term based on the document Hamming distance can yield asymptotic accuracies comparable to the best Naive Bayesian and Markov Random Field classifiers, and at speeds equal to or faster than those classifiers operated in thin-threshold configurations. For a thick-threshold configuration, a Bayesian filter is probably superior to KNN or Hyperspace filters.
K-nearest-neighbor classification itself with hyperspatial weighting may be useful for spam filtering, as the KNN algorithm is not dependent on linear separability of classes as Naive Bayesian classification is.
It is unlikely that KNN-based filters would be useful in single-pass TOE systems, because of their weaker error rate in single-pass learning. However, their great speed may compensate for the learning rate in single-filter-many-users deployments.
Assis, F., Yerazunis, W., Siefkes, C., Chhabra, S., 2005, “CRM114 versus Mr. X: CRM114 Notes for the TREC 2005 Spam Track”. In Text REtrieval Conference (TREC), 2005
Bratko, A. and Filipic. B, 2005, “Spam Filtering using Compression Models”, NIST TREC 2005, downloadable from http://ai.ijs.si/andrej/papers/ijs_dp9227.html
Chhabra, S., Yerazunis, W,S., Siefkes, 2004 “Spam Filtering using a Markov Random Field Model with Variable Weighting Schemas”, ICDM ‘04, http://www.siefkes.net/papers/mrf-spamfiltering.pdf
Cormack, G. and Lynam, T., “Spam Corpus Creation for TREC”, in Text REtrieval Conference (TREC) 2005
Cover, T.M. and Hart, P.E. , 1967: “Nearest Neighbor Pattern Classification”, IEEE Transactions on Information Theory, 13:21-27, 1967
Fix, E., Hodges, J.L. 1951,” Discriminatory analysis, nonparametric discrimination consistency properties”, Technical Report 4, United States Air Force, Randolph Field, TX.
Graham, P., 2002, “A Plan For Spam”, http://www.paulgraham.com/spam.html
Siefkes, C, Assis, F., Chhabra, S., Yerazunis, W. 2004 “Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering”. In Proceedings of the 15th European Conference on Machine Learning and 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2004), Lecture Notes in Computer Science. Springer, 2004.
Willets, K, 2003, “Spam Filtering with gzip”, http://www.kuro5hin.org/story/2003/1/25/224415/367
Zdziarski, J. 2005 Ending Spam ISBN: 1593270526 No Starch Press