Entropic Classification Basics
This is an implementation of an entropic bitwise lattice Markov classifier. This is very roughly based on some ideas in the DMC (Dynamic Markov Compressor) data compression technique of Gordon Cormack et al and inspired by Andrej Bratko's use of it in spam filtering ( Gordon originally proposed DMC for compression and then Andrej used it for spam filtering. Matthew Young-Lai then proposed a technique for merging nodes which is limited to re-merging the "cloned" nodes in Gordon's work. Our merging method here has no such limitation.) This particular implementation is entirely separate, similar in tht it uses a Markov chain, but does not use cloning. Rather it uses cross- linking, in a minimum-entropy format, and supports "open-ended" Markov chains, rather than the toroidal lattices required in Cormack's DMC.
For those of you into references as "closest prior", see: 1) Cormack, Gordon V. et al, Data Compression using Dynamic Markov Modeling (Computer Journal #6, 1987 2) Bratko, Andrej, Markov Modeling for Spam Filtering (?) TREC 2005 3) Matthew Young-Lai, Adding state merging to the DMC data compression algorithm, Information Processing Letters 70 (1999) 223â��228.
The basic concept is to construct a Markov chain that describes the sequential bit pattern of a message, and then use that chain to compress the incoming unknown text; the smaller the entropy the better the quality of match. Yes, this sounds like a horrible idea, but Andrej Bratko at TREC 2005 has shown that it does work. One advantage is that it can work on files that have no obvious tokenizations, like JPGs.
Like every other Markov chain, there's a start node, and then for each observed 1 or 0 there's an observed freqency of 1 or 0, and the associated next node for a 1 or 0. The quality of match is just the sums of the entropy (*) of each path taken.
More correctly, entropy is the weighted sum for an ensemble of signals, and is -Psig * log2 (Psig), and the sum over an ensemble of possible texts yields the entropy which is the average bit rate. Whichever Markov chain shows less entropy is the one that matches best. However, in this case we're only given one example, which is the text in consideration, and so the probability of each transition is 1.0000 exactly (as we have only a single sample of the unknown). Thus, we actually sum the bits required; that is, log2 (Psig).
Nota Bene 1: in order to make this code a little bit more general, we have each node accept an alphabet of N different symbols, where N defaults to 2 (that is, bitwise) but we'll write the code to be more general than that; it's a compile-time change rather than a hack-the-code change. States *are* always numbered starting at zero, though, so consider the alphabet to always be an ENUM.
Nota Bene 2: If you happen to be familiar with the other CRM114 classifiers, note that this classifier does NOT use the tokenizer regex system, nor the SBPH or the OSB feature set. Instead, all that state information is captured by where you are in the Markov chain, and the concept of "token" is replaced by the concept of "alphabet". By default, right now the "alphabet" is single bits.
Nota Bene 3: As currently designed, the learning algorithm defaults to initialize with a toroidal perfect shuffle- thus the default model always takes constant memory and you can't run off the "edge" of the model. We still need to fix it so that "crosslink" will allow a jump to another more appropriate (lower entropy) part of the braid; that's not there yet.
Nota Bene 4: As the toroidal shuffle didn't give extremely good results, there's also a "unique state" classifier, set by the flag < unique >. This generates a *loopless* chain where each text learned produces a single linear chain; this means we follow whatever model has already had the same bit pattern, but whenever we diverge, we start allocating new bit nodes immediately. This works decently but uses up a lot of nodespace quickly. It also means that you can run off the "edge" of the model because it's not closed like the toroidal shuffle. By default, running off the edge causes a restart at the START node (node 1, as it turns out; node 0 is a bookkeeping/anchor/sentinel node). This works, but doesn't give very high accuracy as the reSTARTs are not time-aligned with the model; they might even occur in the middle of a byte.
Nota Bene 5: In the quest for better accuracy, instead of doing a reSTART when we're classifying and fall off the edge of a 'unique' model, we branch to a node that has a very similar prior history. How this is done is seen below but it boils down to finding a node that already exists that has had a very similar prior history bit pattern. This can actually be done in near-constant (and pretty small!) time; the magic to do this is all described below.
Nota Bene 6: In the quest for smaller (and faster!) models and the ability to not run out of nodes, we have the <crosslink> flag, which allows the LEARNing process to do the same branching that is done when we run off the end of the model chain during classification. More specifically, before allocating a new node, we look for the closest prior history in already-allocated nodes, and if it's "close enough" (as set by a threshold), we make that the next node in our chain. This produces Markov models with both acyclic joins and repeating loops, and effectively prevents the model from growing without bound, as the model reuses nodes quite effectively (sometimes more than a 300:1 reuse!). It also produces a model that is quite accurate when trained with SSTTT. Although we can argue that ROCAC isn't the best possible figure of merit, SSTTT can yield very good ROCAC values (0.03 per cent); the terminal error rate on the TREC06 public corpus can be as low as 3 errors per 10,000 (that is, three errors in the final ten thousand messages, of the 92Kmsgs corpus.)
Algorithm for Classification:
Follow the Markov chain from START with total entropy of 0 to begin, then walk the Markov chain taking each branch as instructed by the successive bits of incoming text. Along the path, accumulate the compressed size as defined by - log2 (Pt) for the current transition. Here, we define probability for a transition Pt = Nt + c / (Nt + N~t + 2c). Note that for c = 0 this can yield probabilities == 0.00 exactly, which will yield an infinite entropy (log of zero is... not a good thing). Gordon Cormack's paper indicates a "small" value of c is good for shorter texts; unfortunately there's no mention of what "small" means.
If you run off the end of the chain (some learning algorithms generate terminating chains, others always generate loops) then do one of : (1) stop accumulating entropy; you can't go any further; (2) go back to the START node and keep accumulating, or (3) jump to the node with the closest prior history and keep accumulating.
Do this on each bitwise markov file; the file which accumulates the lowest compressed size is the best matching file.
Algorithm for Learning - Overview Before you learn in the first text, initialize the data structure with a START node with NULL exit nodes for each of the possible exits. If you want, you can put something bigger or more complex in, but it's not necessary. If you're running with constant storage, allocate the rest of the storage, preformat them as a linked list of free nodes, and off you go.
Follow the chain, incrementing "seen" counts as you go. If you're learning a file, and you end up at a node exit that's a NULL (meaning no node is yet linked in for the current bit pattern) then grab a node off the free list (or malloc one). See below under "advanced methods" for how to get more nodes by merging very similar nodes together.
Whenever a node is seen with both a sufficiently large number of
seens, and a sufficiently balanced number of differing symbols, "clone" the node so that the Markov chain branches on the previous node. This is to split the chain where the current state is insufficient to model the underlying process. This only can work if the Markov model somehow contains a merge; otherwise each node can only be reached from a single other node (and so on, all the way back to the START node).
In Cormack's version, when you run out of free nodes, you must reinitialize the storage and rebuild the model (using only the last part of the corpus data), but there are ways to merge similar nodes (see "advanced methods" below) to get more free nodes without throwing away data. Note - there is a "classical" method for building a hidden Markov model, but it's O(n^2) in the number of nodes in the output model! It basically uses hillclimbing techniques to form the convex hull and then climbing to the local top of the hull repeatedly. That's fine for small HMMs but here, when we may have a million nodes, that's way too slow.
Method 1 - for both initial-loop or terminated Markov chains
Follow the Markov chain from START, incrementing the seen-0 and seen-1 bits of the Markov chain. When you hit a 0 or 1 choice pointing to NULL, grab an unused node off of the free list. If the free list is empty, use an advanced method to get more.
Advanced Method - Cloning a Node
If a node has more than one infeeder, there is always the possibility that the node really should be two nodes. This could happen either because of a prior merge, a prior clone, or because the initial node setup had a loop in it (which it may or may not; it can work both ways.)
A node should be cloned when there is a sufficient number of incoming events to be statistically significant, and the outgoing events are sufficiently balanced (i.e. the current node is of sufficiently low predictive value - that is, it's high entropy no matter what you do ) to lend credence to the idea that this one node really represents two states and splitting the node is a good idea.
Of course, there's no reason (or way!) to clone a node that has only one infeeder even if the node state itself has low predictive power; in this case what's really happening is that the real corpus' hidden Markov model is diverging at a prior node. There simply isn't any predictive state available.
Topic for further work - given only a single infeeder, maybe we should back propagate the "split" up the chain of single priors until a state is found with at least two infeeders, and clone *that* node and all successor nodes down to the current node. Note that if we hit the START node we don't do this as there's no way to clone START.
The simple and obvious way to clone a node of low predictive power is to keep track of the previous node while doing a training traverse, and only clone off the path from the previous node. This means creating a new node with the same successor nodes as the current node (or not; it depends on whether the current node has a successor or is "end of the line"), and has only a single infeeder, which we know is at best a single line of prior states of arbitrary length, and at worst a series of nodes that all looked similar at some time in the past.
Advanced methods - getting free nodes by merging nodes
NOTA BENE: the code to implement merging is not debugged because the open-graph training with crosslinking works so well. Consider this stuff more as "design notes for the future".
merging nodes - in order to represent loops in the markov model, it's sometimes necessary to loop back into prior Markov states. Finding the appropriate prior state can be a hassle. Cormack and Lai both postulate a "way" to find the appropriate prior state, but don't describe an efficient way to do so.
Here's a shortcut, using arithmetic to get an infinite impulse response out of a floating point number; the already established node with the closest floating point number to the current generated floating point number is the "closest node". This is a different version of "arithmetical coding", but it can be lossy (in a controlled way). See the Guazzo coding method in Cormack's paper - note that we do NOT use Guazzo coding for generating output but rather just for quickly finding the node whth the closest prior state for us to jump to when we run off the edge of a model.
Clearly, one could weight the prior bit by a factor of 1/2 (really 1/(2^1)), and weight the prior float by 1 - (1/2) in which case the prior float's mantissa is just the last N bits, in reverse order. Unfortunately, this also gaurantees that prior history utterly disappears as soon as it rolls out of the number of bits in the matissa of the local floating point operation.
More interesting is larger values of the exponent; for example a value of 2 yields a half-bit fractional representation, so an error of 1 bit seen N steps in the past is overrideable by two correct bits at N+1 and N+2 bits in the past. For example, the with an exponent of 2, the most recent bit has significance 1/4 = 0.250, while the second-most-recent is 3/4 * 1/4 = 3/16 = .1875, and the bit before that is 3/4 * 3/16 = 9/64 = .04687, and so on. Higher exponents just stretch out this series even further. There is no need to assume only an integer exponent. Indeed, it's only because of the finite precision of the computer that this a finite impulse response (FIR) filter; infinite precision would give infinite impulse response and "total recall" of prior states in the Markov process - but also the ability for some long-ago series of perfectly correct steps to blow away the more recent entropy. For this reason, we usually use an exponent of 1, which yields a prior bit weight of 1/2 (whch is just the floating-point mantissa as described above). (NB: as of 20070101, the "floating point" has been replaced by a "long long" with 64 bits of precision. This uses less space) Values of the exponent less than one yield a situation where the range of possible outcome FIR numbers becomes discontinuous; that is, significant chunks of the range become unreachable by any possible prior bit series. Consider the FIR value 0.5000 with exponent of 0.5. The most recent bit then has weighting of 0.70711, with only .29289 of weight available for all following bits. The maximum value achieveable for a pattern 0111111.. is then .29289, while the minimum value achieveable for pattern 100000... is 0.70711, so the range of possible FIR numbers does not contain any value between .29289 and .70711. Recursively, similar subranges cease to be accessible as well.
The exponent for "exact ambivalence" between an error in state N versus two errors in state N-1 and N-2 is around 1.385; the exponent for ambivalence between an error in state N versus errors in states N-1, N-2, and N-3 is about 1.13 . Note that both of these numbers yield rather short "memories" in the FIR filter before the Nth prior bit underflows and is lost completely.
By using the FIR coding of prior state, we need only keep a table of the FIR path floating point number generated on the path that first generated this node (which can be sorted, yielding a _very_ fast node merging capability; just make one pass through the sorted table looking for the two closest floating point numbers.) Those close pairings of nodes are good merge candidates.
It's probably worth checking these merge candidates to see if they have similar local probabilities - that is, if they have similar 0-counts and 1-counts. If one node goes 0 mostly and the other goes 1 mostly, then there's a good chance that this isn't a good merge. How to weight FIR prior history match versus the local outbound probability is a good question.
Another merge candidate is to look at nodes which have identical 0 or 1 destinations. However, finding this requires either an extensible list of back pointers or brute-force search (albeit over a very small subset of the entire Marov Chain).
If we're running in <unique> mode (that is, nodes are allocated on each unique path, and no node has ancestors with any possibility of significantly deviating FIR values, We can also back-calculate the approximate "feeder" node FIR values that could lead to a particular node, given the decreasing exponent. For example, if the current node's FIR is .3700000, then it can only have been reached by a 0-branch from a node whose prior was (2 * .3700000 = .7400000 ) plus or minus any crosslink-error threshold (which is typically under 1 in a million or less). So, we can just use our FIR lookaside table to find all possible ancestors to this node.
When merging two nodes, we (speaking in the 0/1 tense; larger alphabets are an obvious extension) we perform the following steps:
and 1-seen counts. Without any loss, we can choose the
to-be-kept node is the node with the lower index number. * we keep whichever output chain had the highest counts of each of the 0-seen and 1-seen categories * we set the FIR prior history to the weighted average of the two prior weighted averages * for the two output nodes which were _not_ kept (a 0-chain and a 1-chain) mark these chains for eligibility for garbage if we choose active collection; * put the merged-away node on the free list. * make an update pass thorugh the nodeset; any node outlet pointing to the merged-away node is reset to point to the to-be-kept node. We may choose to merge more than one node when we encounter an empty free list. Each merge gives us one node for sure, plus the possibility of many others. At some point, a garbage collection is necessary to reclaim nodes that no longer have a prior node. We can do this either by: * reference counts - each node can keep an accurate reference count of nodes that can transfer to this node. Since there are only three events that can cause the reference count to change (those being "allocate from free list == refcount = 1", "clone node refcount = split incomings, and "merge with another node == refcount1 + refcount2" then the refcounts can be maintained at relatively low cost. * we can run a "classic" mark and sweep. However, there's a better way... * we could add back pointers in each node. Unfortunately, that means the size of a node could grow rather large. There is still a better way.... * We could add chaining pointers on the _output_ of each seen slot. The "to" node gets a single slot added, to point to the first node that can transfer in; each such node points to the next node that can also transfer to the target node. Preferably, the insertions are done in order so the list remains sorted.
When learning a new text, we also compute the running FIR prior floating point value, and on each node we encounter in our path, we update the local prior state FIR value according to the weighted average of the new text (with weight 1) and the prior texts (with weight = total times this node has been visited in the past = sum of the 0-seen and 1-seen output values ). For this reason, node FIR prior encodings may change so it's necessary for the sorted FIR vector to contain not just the FIR values, but also backpointers to the FIR nodes that held them. Improvement: rather than a separate sorted list of FIR values, embed the FIR value in the cell, along with the "next less" and "next greater" indices. Thus, each cell contains within it pointers to it's most likely merge candidates. Secondary improvement - when allocating a new node from the free list, how does one efficiently link the new node into the next-less and next-greater lists? Sequential search clearly works, but is inefficient. Clearly a binary (or near-binary) tree would assist this but it's a programming hassle to maintain the two data structures in synchrony.
Alternatively, a fixed-size table of nodes at or close to various "thresholds" can be maintained; when a node changes it's FIR value, it only takes constant time to round the FIR to a table entry value and check to see if the new value is closer to the "perfect" value of FIR for that table entry versus the node that is currently in that slot. If it's better (not just as good, "better", to minimize thrashing) then the new node index is written into the table; otherwise the old node stays.
Note that nothing prevents a node from being in the table more
than once, nor is there any correctness "need" for the table to be exactly right. Instead, the lookaside table is just a speedup that helps a lot most of the time. When a node changes value, it will move (some) so incorrectness is part of the game. If you really want, you can move the node to a new lookaside table slot, and then look at the previous next-less and next-greater nodes, and choose one of them to go into the table, replacing the (now moved) node. (NB: in implementation, it turns out that the FIR lookaside table is *critical* for reasonable performance; any bug in that code that causes weak inital guesses will slow the classifier down by a factor of 10 or more.
Futher Hacks and Improvements: resuming a CLASSIFY past NULL. If the classifier runs off the end of the known Markov chains, what do we do? Just drop the following text? Possibly - but there is an alternative.
We can use the FIR arithmetical coding to resume the Markov chain at a near-optimal position. The difference between where we are and where we resume should be as small as possible; fortunately the FIR-next-least and FIR-next-greater indices in the current node give the two nearest side-steps at the current Markov node; we can then explore up and down the FIR chain to find a preexisting node that actually has a further non-NULL chain. The entropy increase due to this side-step outside the Markov chain is not entirely computable from the FIR values, because the prior information that's already rolled off the least significant bit of the FIR calculation. (note: this is currently the default action)
IMPLEMENTATION AND STORAGE DETAILS
Data storage format for bitwise Markov classifiers. Note that we always use indices rather than pointers as we want the data structure to be stored and reloaded from disk via mmap.
We have several "global static" values, which are stored in the meta-start node which is always node 0:
in alphabet-zero slot.
If we have a FIR lookaside speedup table, that also needs to have space allocated.
Each normal node contains a header containing:
These FIR prior values are updated whenever a node is visited during learning. This may mean minor motion up and down the FIR chain.
Each node also contains BITMARKOV_ALPHABET_SIZE slots; each slot contains the following (so there's one such slot for each member of the alphabet A). There are two kinds of data stored in each slot; they are related but not interdependent. The forward markov slot data is used during classification; the backward slots are used only during maintenance.
Forward Markov slots:
is (index, not pointer, for position-independent code)
Backward slots for maintenance (clone and merge)
node when that prior node saw this member of the alphabet A
(note that this is NOT the alphabet member that the current node is seeing. This just gives an array of backlinks for each member of the alphabet). NULL if there's no known encounters of an incoming with the most recent prior for this alphabet member.
same outgoing node as this node. In combination with the
prior index above, this forms a linked list of all of the nodes that can reach a particular node. As we don't need to traverse this list in any particular order, it's strictly a "stack" (newest added at the front).
Using 64-bit storage for the floating-point FIR values and for the visit counts, and 32-bit storage for all indices, then assuming we're using a binary alphabet, we have:
Per-cell FIR prior float = 8 bytes FIR up and down links - 2 * 4 bytes = 8 bytes Slot within cell forward count - 8 bytes * 2 = 16 bytes forward index - 4 bytes * 2 = 8 bytes backward first - 4 bytes * 2 = 8 bytes backward next - 4 bytes * 2 = 8 bytes
Total - 56 bytes storage per nodem, 18K nodes per megabyte. Note that the absolute minimum size is 24 bytes/node with no advanced features, so we're about a factor of two worse in space, and for that we get near-linear in size for both learning and classification.
If we add in a 0.1% overhead size for the FIR prior lookaside accelleration table (containing 1 index per 4-byte word) then for every 4 bytes (1 index) of FIR prior lookaside there are 4Kbytes (about 73) FIR value cells, so with the FIR prior lookaside and an assumption of roughly uniform FIR prior values, we need to traverse an average of 1/4 of 73 or about 18 markov chain cells. With a 1.0% overhead (1 byte per 100 bytes, or 1 index slot per 400 bytes) then there is 1 index slot per 7 markov chain cells, and (again assuming a uniform distribution of FIR priors) an average of just 1.8 markov chain cells visited per FIR lookup.
If the distribution of FIR values is too nonuniform, an adaptive grid method may be used to smooth it out. But we'll climb that mountain when we come to it.
Note to self: don't use the mathlib LOG function! It's far too slow. We'll custom craft our own that uses linear interpolation over a lookup table.
An interesting information-theoretic philosophical point - if we only have a million nodes in the database, then at most we encode 20 bits worth of state. A 500-megabyte file like Andrej uses would have at most 62 million nodes (12 bytes/node not with short counts), or 41 million nodes = 27 bits to define state. Since the FIR-prior value encodes at least 64 bits, can go to 96 bits (GCC 4.0 "long double") and possibly 128 (if your compiler supports "-mlong-double-128") shouldn't we be sticking with the FIR-prior and simply *ignoring* the actual links, as they (and the state nodes) encode far less information? Submitted as an "ultimate" pare-down - the memory contains ONLY the times-previously-seen information. It's a big array, indexed by the most significant N bits of the prior, with the high-order bit being the transition now being executed. All that's there in the array is the number of times this transition has been executed in LEARN mode; everything else is embedded in the state (which _is_ the index). That optimization would concieveably allow one to write an embedded form of bit-entropic classification in as little as two to four megabytes of space per class; in short, something that would fit on an embedded system.
File Structure and Layout for Bit Entropy
The file starts with a header block, of size 4 kybtes; the first four longwords give the indexes (in longwords, for 32-bit alignment) of the start and length of the FIR lookaside table and the node list.
Per-node memory layout
We have three pointers in the node slot; each of these is replicated for each member of the alphabet in the Markov graph. * The .nextcell says what the next cell is when we see this alph * The .backfirst cell says what the first cell is that could get us here when it saw an alph * The .samenext cell says what the next cell that also gets you you to the same nextcell when you see the same alph.
Yes, this is slightly redundant, but it's very fast. (but note the
approximation above that would save us all this memory!) A diagram here helps to understand the slotting structure node .nextcell >------------------->--> node .backfirst <---\ />------->/ .nextcell >----------> .samenext -| \---------------------< .backfirst | / node <-| / .nextcell >---/
Note- "backfirst" and "samenext" are redundant given the FIRlat system; we can always find referrers to the current node by looking at the possible origins - a node of FIR value 0.260 can only come from the alpha link of a node of FIR value 0.520 ; a node of value 0.779 can only have come from the 1-link of a FIR of .558 so we can just search the FIR around there.