CRM114 - the Controllable Regex Mutilator
 =================================================================
20070406-BlameSpamConf
 Latest release in src/ and CVS - mostly bugfixes again - some details will be 
 added later on ...
20070217-BlameBaltar
 This is a bugfix version.
 Fixed:
mail*.crm: :datadir: is now gone;
mailreaver.crm: You can now use start/len on INPUT on stdin.
Arithmetic now respects >=, <=, and !=.
Hyperspace has been de-normalized
Entropic classification is now somewhat smarter.
SYSCALL got stepped on (mostly).
A bad initialization in the Entropic.
An incorrect result variable parsing in TRANSLATE.

 check src/README for details
   _________________________________________________________________
20061103-BlameDalkey
 Fixed:
      BSD compiles
      Missing mailfilter backslash
      Default doesn't DEFAULT
      BENs don't scale right
      Hyperspace and entropy have wierd thick thresholds
      entropy uses too much disk space                  
  Not fixed:
      wierd results in OSB multiclass (any clues???)
This version is yet further bugfixes.  It has Paolo's "BSD doesn't
have logl() so do this instead" bugfix, as well as the entropy
sixfold segfault bugfix, the missing mailfilter.cf backslash bugfix,
the DEFAULT doesn't default bugfix, and the bit-entropy FIR-prior
threshold is now automagically scaled to work decently for different
sized .ben files.  The .BEN structure has been cleaned up (uses less
disk space at no accuracy cost!), and both Hyperspace and
bit-entropy pR values have been "rescaled" so that a pR within +10
to -10 gives good results for thick threshold training (so now you
don't have to change thresholds when you change classifiers).  The
only remaining hitch is that multi-class OSB sometimes gives
wierdish results.
   _________________________________________________________________
20061010-BlameBratko
 New bugfix-only release (with the :decision_length: fix) and
 a few other tweaks - almost 'secret' likely intermediate
   _________________________________________________________________
20060926-BlameNico
 Bug fix (segfault crm_trigger_fault(); cleaned up warnings in 
 crm_bit_entropy.c)
   _________________________________________________________________
20060920-BlameNico
 ( Bug fixes, new bit-entropic classifier. )
 This release has two purposes:
 1) it gets rid of two nasty typos - the one in Hyperspace, and
 the one in mailfilterconfig.cf; it also adds --static-libgcc
 to the default link.
 2) It's an initial stake-in-the-ground for the bit-entropic classifier.
 Bit-entropy classification is slower and uses more memory, but
 seems to be even more accurate.  Read the README and see if you
 want to play with bit-entropy; it's highly experimental, will probably
 change formats, and should NOT be used for production.  But
 on the other hand, it's REALLY accurate in long runs.  The actual
 design is the first 500 lines or so of the new file crm_bit_entropy.c
   _________________________________________________________________
20060704a-BlameRobert
 ( Bugfixes, and mailreaver is now recommended! )
 
 Both of these kits include the new TRE 0.7.4 library with the \E patch,
 either as source or statically linked into the executable binary.
 This is a big new functionality release. Mailreaver (the second-generation
 filter) is now the recommended filter, using mailtrainer.crm to do all
 training  and  with  the reaver cacheing enabled (yes, this means that
 mailfilter.crm is now slightly 'deprecated'; migrate to mailreaver to get
 better accuracy and easier use as mailreaver's cache is a big improvement).
 The default classifier has changed from Markovian to OSB, using DSTTTR. This
 new mailtrainer program is fed directories of example texts (one example per
 file), and produces optimized satistics files matched to your particular
 mailfilter.cf setup (each 1 meg of example takes about a minute of CPU). It
 even  does  N-fold  validation.  Default training is 5-pass DSTTTTR (a
 Fidelis-inspired improvement of TUNE) with a thick threshold of 5.0 pR
 units. Worst-offender DSTTTTR training as a (very slow) option.
 There are also speedups and bugfixes throughout the code. Unless you really
 like Markovian, now is a good time to think about saving your old .css files
 and switching over. Then run mailtrainer.crm on your saved spam and good
 mail files, and see how your accuracy jumps. I'm seeing about a four-fold
 increase in accuracy on the TREC SA corpus; this is hot stuff indeed.
 Note  that  this  is a "a" respin - there was a bug in the QA that let
 nonfunctioning code out. Sorry.
   _________________________________________________________________
20060118-BlameTheReavers
 ( Bugfixes and NEW mailtrainer.crm )
 Both of these kits include the TRE 0.7.2 library with the \E patch, either
 as source or statically linked into the executable binary.
 This is a big new functionality release- we include mailtrainer.crm as well
 as changing the default mailfilter.crm from Markovian to OSB. This new
 mailtrainer program is fed directories of example texts (one example per
 file), and produces optimized satistics files matched to your particular
 mailfilter.cf setup (each 1 meg of example takes about a minute of CPU). It
 even  does  N-fold  validation.  Default training is 5-pass DSTTTTR (a
 Fidelis-inspired improvement of TUNE) with a thick threshold of 5.0 pR
 units. Worst-offender DSTTTTR training as a (very slow) option. There are
 also speedups and bugfixes throughout the code. Unless you really like
 Markovian, now is a good time to think about saving your old .css files and
 switching over to the new default mailfilter.crm config that uses OSB unique
 microgroom. Then run mailtrainer.crm on your saved spam and good mail files,
 and see how your accuracy jumps. I'm seeing about a four-fold increase in
 accuracy on the TREC SA corpus; this is hot stuff indeed.
 HOWEVER, the downside is that mailtrainer.crm expects to see your spam and
 good training data files in a maildir-like format (one dir for spam, the
 other for good); this isn't directly supported by mailfilter.crm yet, so
 unless your mailer supports maildirs, you will need to write a little script
 to build your training data directories.
   _________________________________________________________________
20050910-BlameToyotomi
 ("So, everyone commits suicide in this story TOO?")
 Both of these kits include the TRE 0.7.2 library with the \E patch, either
 as source or statically linked into the executable binary.
 This is mostly a docmentation/bugfix release. New features are that the Big
 Book is now very close to "final quality", some improvements in speed and
 orthogonality  of  the  code,  bugfixes in the execution engine and in
 mailfilter.crm, and allowing hex input and output in EVAL math computations
 ( the x and X formats are now allowed, but are limited to 32-bit integers;
 this is for future integration of libSVM to allow SVM-based classifiers).
 Upgrade is recommended only if you're coding your own filters (to get the
 new documentation) or if you are experiencing buggy behavior with prior
 releases.
   _________________________________________________________________
20050721-BlameNeilArmstrong
 ("Does God hear the prayers of toasters?")
 Both of these kits include the TRE 0.7.2 library with the \E patch, either
 as source or statically linked into the executable binary.
 BlameNeilArmstrong has the mmap patch, and introduces two new classifier
 options- UNIGRAM and HYPERSPACE - and a new ISOLATE option, DEFAULT. ISOLATE
 DEFAULT  tells the ISOLATE to only change the value if the variable is
 previously undefined (this is exactly what you want to easily set default
 values for things that might get set from the command line). Unigram turns
 off the multi-word features, which lets you test CRM114 against a "nominal"
 Bayesian classifier to see if the multiword actually helps in your spam mix
 (and  by how much) without breaking all of your careful configuration.
 Hyperspace uses the new hyperspatial classifier, which isn't compatible with
 anything else but runs super accurate and scarey fast on only a few hundred
 K of features (4147 messages classified at > 99.5% accuracy, 300Kbytes of
 feature data, in under 1 minute on a Pentium-M 1.6 GHz.) Hyperspace is still
 experimental, so keep your text copies of training sets if you decide to
 play with it.
   _________________________________________________________________
20050518.BlameMercury
 ("How bad can you be and still go to heaven?")
 Both of these kits include the TRE 0.7.2 library with the \E patch, either
 as source or statically linked into the executable binary.
 This version is a documentation primary release - CRM114 Revealed (.pdf,
 250+ pages) is now available for download. BlameMercury has lots of bugfixes
 and only three extensions - you can now demonize minions onto pipes, you can
 re-execute failing commands from a TRAP finish, and you can now use a regex
 directly as a var-restriction subscripting action, so [ :my_var: /abc.*xyz/
 ] gets you the matching substring in :my_var: . Var-restriction matches do
 NOT change the "previous MATCH" data on a variable.
   _________________________________________________________________
20050415.BlameTheIRS
 ("Do You Feel Lucky Today?")
 Both of these kits include the TRE 0.7.2 library with the \E patch, either
 as source or statically linked into the executable binary.
 Math expressions can now set whether they are algebraic or RPN by a leading
 A or N as the first character of the string. Listings are now controllable
 with  -l N, from simple prettyprinting to full JIT parse. A bug in the
 microgroomer that causes problems when both microgrooming and unique were
 used in the same learning scenario was squashed in Markovian, OSB, and
 Winnow  learners  (it remains in OSBF). Dependency on formail (part of
 procmail) in the default mailfilter.crm has been removed. A cleaner method
 of call-by-name, the :+: indirection-fetch operator, has been activated. The
 var :_cd: give the call depth for non tail recursive routines. Minor bugs
 have  been  fixed  and  minor  speedups added. "make uninstall" works.
 Documentation of regexes has been improved. Cached .css mapping activated.
 Win32 mmap adapters inserted.
   _________________________________________________________________
20041231.BlameSanAndreas
 ("I never could get the hang of Thursdays")
 Both of these kits include the TRE library, either as source or statically
 linked into the executable binary.
 Major topics: New highest-accuracy voodoo OSBF classifier (from Fidelis
 Assis), CALL/RETURN now work less unintuitively, SYSCALL can now fork the
 local process with very low overhead (excellent for demons that spawn a new
 instance for each connection), and of course bug fixes around the block.
 Floating point now accepts exponential notation (i.e. 6.02E23 is now valid)
 and  you can specify output formatting in e, E, f, F, g, and G styles.
 MICROGROOM is now much smarter (thanks again to Fidelis) and you can now do
 windowing BYCHUNK. The memory leak on repeated UNIQUE LEARNS has been fixed.
 This  new revision has Fidelis Assis' new OSBF local confidence factor
 generator; with the OSB front end and single-sided threshold training with
 pR of roughly 10, it is more than 3 times more accurate and 6 times faster
 than straight SBPH Markovian and uses 1/10th the file space.
 The only downsides are that the OSBF file format is incompatible and not
 interconvertable between .css files and OSBF .cfc files, and that you _must_
 use single-sided threshold training to achieve this accuracy. Single-sided
 threshold training means that if a particular text didn't score above a
 certain pR value, it gets trained even if it was classified correctly. For
 the current formulation of OSBF, training all nonspams with pR's less than
 10, and all spams with pR's greater than -10 yields a very impressive 17
 errors on the SA torture test, versus 42 errors with Winnow (doublesided
 threshold with a threshold of 0.5) and straight Markovian (54 errors with
 Train  Only Errors training, equivalent to singlesided training with a
 threshold of zero)
 We also have several improvement in OSB, which gets down to 22 errors on the
 same torture test, again with the same training regimen (train if you aren't
 at least 10 pR units "sure", and _is_ upward compatible with prior OSB and
 Markovian .css files, and with the same speed as OSBF. It also doesn't use
 the "voodoo exponential confidence factor", so it may be a more general
 solution (on parsimony grounds); it has similar properties to OSB.
 CLASSIFY  and  LEARN  both  default  to  the obvious tokenize regex of
 /[[:graph:]]+/. CALL now takes three parameters: CALL /:routine_to_call:/
 [:downcall_concat:] (:return_var:)
 The routine itself gets one parameter, which is the concatenation of all
 downcall_concat args (use a MATCH to shred it any way you want).
 RETURN now has one parameter: RETURN /:return_concat:/ The values in the
 :return_concat: are concatenated, and returned to the CALL statement as the
 new value of :return_var: ; they replace whatever was in :return_var:
 SYSCALL can now fork the local process and just keep running; this saves new
 process invocation time, setup time, and time to run the first pass of the
 microcompiler.
 WINDOW now has a new capability flag - BYCHUNK mode, specifically for users
 WINDOWing through large blocks of data. BYCHUNK reads as large a block of
 incoming data in as is available (modulo limits of the available buffer
 space), then applies the regex. BYCHUNK assumes it's read all that will be
 available (and therefore sets EOF), so repeated reads will need to use
 EOFRETRY as well.
   _________________________________________________________________
20040627-BlameSeifkes released June 27 2004
 (Pretty Darn Good, eh?)
   _________________________________________________________________
20040328-BlameStPatrick released March 28, 2004
 Besides the usual minor bugfixes (thanks!) there are two big new features in
 this revision: 1) We now test against ( and ship with ) TRE version 0.6.8 .
 Better, faster, all that. :) 2) A fourth new classifier with very impressive
 statistics is now available. This is the OSB-Winnow classifier, originally
 designed by Christian Siefkes. It combines the OSB frontend with a balanced
 Winnow backend. But it may well be twice as accurate as SBPH Markovian and
 four times more accurate than Bayesian. Like correlative matching, it does
 NOT  produce  a direct probability, but it does produce a pR, and it's
 integrated into the CLASSIFY statement. You invoke it with the winnow flag:
      classify < winnow > (file1.cow | file2.cow) /token_regex/
 and
      learn < winnow > (file1.cow) /token_regex/
      learn < winnow refute > (file2.cow) /token_regex/
 Note that you MUST do two learns on a Winnow .cow files- one "positive" one
 on the correct class, and a "refute" learn on the incorrect class (actually,
 it's more complicated than that and I'm still working out the details.)
 Being  experimental, the OSB-Winnow file format is NOT compatible with
 Markovian, OSB, nor correlator matching, and there's no great functional
 checking mechanism to verify you haven't mixed up a .cow file with a .css
 file. Cssutil, cssdiff, and cssmerge think they can handle the new format-
 but they can't. You can (and should) microgroom with winnow , both on the
 forward and refute passes.
   _________________________________________________________________
   I WANT SOMETHING ELSE - - -> >
   _________________________________________________________________
 Previous versions, and some RPM and .autoconf-modded packages can be found
 in the [90]SourceForge File Releases page. This includes versions that have
 been known to build on Mac OSx, Windows, etc. There is no gaurantee that the
 version you desire will be found here, but it's a good place to look.
   _________________________________________________________________