[This how-to is a somewhat edited version of how to choose the (initial) thresholds? ]

How do you set the thresholds for the usure gap between spam and good?

Good point. A question asked more than once on the ML.

By trial and error

Good answer, eh ;)

Well, that's more or less how thresholds in mail*.crm scripts have been set. You setup your classifier, let it go, check the pR you get, tune the knobs till all seems reasonable to your needs/taste/feeling.

Likely that's going to stay that way, nevertheless you may want to get some guideline on how to get started at all, so here is what might be a reasonable (beyond wild guess, that is) way to set the initial threshold values, for the choosen classifier.

Here are the steps:

0. you have known_spam, known_good msg collection, in some format suitable   
   for training scripts of your choice.

1. choose a classifier and suitable params *but set threshold(s) to 0* 
   (but if you use mailtrainer.crm, you must set a value > 0, eg 0.0001);
   build the CSSs good.css, spam.css by TUNEing over known_spam, known_good
   (Train Until No Errors).

Note on TUNE

2. Now classify with same params as in point 1, with a (shell) script like
   (assuming eg. that known_* are in maildir format):

#for f in known_spam/cur/1* known_good/cur/1*;do
#  [preprocessor] < "$f" | \
#  crm -e -w MAXW '-{
#    isolate (:S:) //
#    classify <...> (good.css spam.css) (:S:)
#    match (:: :b: :s:) [:S:] /#0.+pR:  (.+)\n#1.+pR:  ([^\n]+)/
#    output /:*:b: :*:s:\n/}'
#done | tee pR_hyp.log

   where [preprocessor] stands for anything used in the training chain (eg
   normalizemime), but might be just "head -c DL" where DL=:decision_length:
   in mailfilter.cf, and MAXW >= 2*DL, and replace <...> with whatever you
   used in step 1.
   Of course, if you use any of the fancy stuff in mail*.crm like
   MIMEdecoding, above script won't work, you need to use mailreaver.crm
   with same mailfilter.cf (but mailreaver.crm needs a easy hack to output
   the pRs like above: "pR_good  pR_spam") used in training.

3. Now pR_hyp.log holds the "pR_good  pR_spam" couples (indeed, just the 1st
   is needed) of your training collection.
   min/max in pR_hyp.log are the thresholds that would yield 0 unsures and
   full separation, if msg stream were just (like) the training set.
   So at this point you might want to check, and get a rough estimate of the
   unsure % you'd get with different thresholds and offset, with eg this
   script (lets name it thresholds.awk):

#awk -f '
#  N=Ng=Ns=Nu=Nug=Nus=0
#  m=-100000; M=100000
#  ts=ARGV[1]; of=ARGV[2]; tg=ARGV[3]
#  ARGC=1
#  if(!ts) ts=0
#  if(!of) of=0
#  if(!tg) tg=0
#  p=$1
#  if(p>=0) { if(p<M) M=p }else{ if(p>m) m=p }
#  if(p>=of) if(p>=tg+of){ Ng++ }else{ Nug++ }
#  if(p<of) if(p<=ts+of){ Ns++ }else{ Nus++ }
#  N++
#  Nu=Nug+Nus
#  print "ts="ts+of" of="of" tg="tg+of"  M="M" m="m
#  print "N="N" Ng="Ng" Ns="Ns"  Nu="Nu" Nug="Nug" Nus="Nus
#  printf "N=%d  Ng=%2.2f%% Ns=%2.2f%% Nu=%2.2f%%\n",\
#    N,Ng/N*100,Ns/N*100,Nu/N*100
#}' -- pR_spam_th offset pR_good_th < pR_hyp.log

   which would print

     N  : total msgs
     Ns : sure spam
     Ng : sure good
     Nug: unsure good
     Nus: unsure spam
     Nu : Nug + Nus
     M  : min pR_good_th threshold
     m  : max pR_spam_th threshold
     ts, of, tg : given  pR_spam_th offset pR_good_th
  of course, if pR_spam_th=m, offset=0, pR_good_th=M -> Nu=0 by definition
  (if training set was separable in features space).


% cat pR_hyp.log (from step 2):
-0.93   0.93 
-0.86   0.86 
-0.95   0.95 
-0.95   0.95 
-1.02   1.02 
-0.60   0.60 
-0.68   0.68 
 1.57  -1.57 
-0.97   0.97 
 1.54  -1.54

then in step 3:

% thresholds.awk -- -0.2 0 0.2 < pR_hyp.log
ts=-0.2 of=0 tg=0.2  M=0.02 m=-0.08
N=2782 Ng=341 Ns=2427  Nu=14 Nug=4 Nus=10
N=2782  Ng=12.26% Ns=87.24% Nu=0.50%

which shows I should use very small and unsymmetric thresholds -0.08, 0.02,
while with a -0.2..0.2 'bandgap' I can expect ~0.5% of unsures, but 
unbalanced toward maybe-spam. That's good as I'd rather get an 'unsure' than 
loose a 'good'. 
But I'm looking for a more balanced situation, so can try an offset>0:

% thresholds.awk -- -0.2 0.1 0.2 < pR_hyp.log
ts=-0.1 of=0.1 tg=0.3  M=0.02 m=-0.08
N=2782 Ng=338 Ns=2436  Nu=8 Nug=6 Nus=2
N=2782  Ng=12.15% Ns=87.56% Nu=0.29%

which seems it'd yield almost only half 'unsures' with same bandgap.

Note, that offset here isn't OSBF's 'internal' offset, it's applied on raw 
pR from CLASSIFY with offset=0, hence it's applicable to all classifiers.

note on TUNE

More generally, setting thresholds != 0 on TUNEing, if TUNE succeeds, the
filter will have a nominal (ie on the training corpus) capability of
separating msgs into the classes by at least the preset 'bandgap'.
M, m will be a bit outside the gap, and in 'production' you'd choose 
thresholds within the training bandgap limits (ie nominally 0 unsures).
If the learning algo for unsures is TUNE as well, it's worth keeping the
bandgap on learning larger (like in step 1 above) than on classify.

Of course there are tradeoffs: the larger the bandgap in step 1, hence in 
classify, ...
 - the smaller the chance of unsures
 - but also the longer the training phase
 - and the CSSs quickly either get fairly large (hyperspace,...) or get more
   packed (osb*, ...), slowing down the filter, while taking up more system

Here some results, on a live mailbox, with symbols meaning: 
N=msg count, Nc=msgs classified, NS=spam msgs, NB=good msgs, NxS=uns spam, 
NxB=uns good
Nc<=N since msgs of size >200k are not classified. 
Decision length is 15000 bytes.

*classifier, bandgap, #buckets, learned good/spam, pack.ratio good/spam)
*osb, -3 0 3, 216071, 25/82, 0.18/0.32
CSS pre-TUNEd from 1907 ham+spam with -3 0 3 bandgap
N=2785, Nc=2743, NS=2417, NB=323, NxS=2, NxB=1
NxS/Nc=.07% NxB/Nc=.03%

real    1m32.692s
user    1m9.220s
sys     0m20.900s

*osbf, -2 0 2, 99277, 21/70, 0.33/0.52
CSS pre-TUNEd from 1907 ham+spam with -2 0 2 bandgap
N=2785, Nc=2743, NS=2417, NB=324, NxS=2, NxB=0
NxS/Nc=.07% NxB/Nc=0%

real    1m31.543s
user    1m10.200s
sys     0m18.840s

*hyperspace, -0.05 0 0.02, 22/21, Average f.count: 97080/78924
                                  size: 97168/79008 bytes
CSS pre-TUNEd from 1907 ham+spam with 0 0 0 bandgap
N=2787, Nc=2745, NS=2421, NB=324, NxS=0, NxB=0
xS/Nc=0% xB/Nc=0%

real    1m39.114s
user    1m10.620s
sys     0m16.310s

Note the tiny size of hyperspace's CSS. pre-TUNEing with -0.2 0 0.2 
bandgap bumps sizes up to 234424/208048, with +2x learned msgs.
thresholds-how_l_-to.txt · Last modified: Y/m/d H:i:s O (T) by oopla
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki