A lamb, duh!: Decoding Huffman codes without the tree

Okay, so, last time I demonstrated how to serialize a Huffman decoding tree into a simple stack-based language for rebuilding the tree. This was pretty interesting in it's own right, in my opinion, but was only a step down the road to the material in this installment... how to decode the Huffman code without the Huffman decoding tree. That is, the intervening data structure, the Huffman decoding tree, is completely eliminated, replacing it with a call graph that does the decoding instead.

Like everything so far, a lot is owed to Heinrich Apfelmus' morse code example, from which this work is directly taken.

This technique of removing the intermediate data structure is called deforestation. In this case, when decoding a string of encoded characters, the Huffman decoding tree is built, and then traversed to find a decoded letter. The construction of the tree is a recursive application of the tree constructors starting at the root of the tree. This construction actually happens backards, starting with leaf nodes which are assembled into branch nodes repeatedly until there is only one branch node, the root, left. In a lazy language like Haskell, these nodes may not be evaluated, but instead left as a tree of thunks, unevaluated function calls, where the function calls are calls to the node constructor. Then the decoding is a recursive traversal of that tree looking for a leaf node to decode the letter.

The construction of the tree looks essentially like this:


(Branch (Branch (Leaf 'a') (Leaf 'b')) (Branch (Leaf 'c') (Branch (Leaf 'd') (Leaf 'e')))

(with some stuff left out for clarity).

Apfelmus' replaces the constructor calls with functions that tell what to do in the case of reading a particular character from the input stream. And this is where one of the fundamental differences between the morse code tree and the Huffman coding tree come to light. The morse code tree is essentially a trie -- the resulting character returned is based on where you are in the tree when you run out of input (morse code is broken into "words" for each letter). Each node in the morse code tree has a letter associated with it. Each dash or dot moves you further into the tree and to a different letter (allowing, I suppose, a form of partial result, though it has no meaning). When the symbols in a particular morse code "word" are used up, whatever node you're looking at is the character that has been decoded.

To implement this, Apfelmus' uses the following, (here I use 1's and 0's instead of dashes and dots):


branch c x y = \code -> case code of
  '0':ds -> x ds
  '1':ds -> y ds
  []     -> c

leaf = undefined

The leaf can be undefined because any well-formed morse code will never reach the leaves of the tree. And if there is no input left, the last case in the branch function, then you're done, emit the character. I won't go into the details of how these functions are used, you can read his post yourself.

In the Huffman tree, this is not the case. First, the incoming code is not broken into "words". It is all one stream. Secondly, the decoding happens at the leaf nodes so the leaf function has to actually do something. Finally, in the decoding process, we have to keep track of what hasn't been decoded so far because we'll need it for the next letter.

The result looks like this:


type Code = String -- an encoded message
type Result = (Char, Code)

branch :: (Code -> Result) -> (Code -> Result) -> (Code -> Result)
branch x y = \code -> case code of
  '0':ds -> x ds
  '1':ds -> y ds


leaf :: Code -> (Code -> Result)
leaf [c] = \code -> (c, code)

Note here the difference. The branch function never returns a decoded character. That is pushed into the leaf function which also returns the unprocessed remaining portion of the code.

Now to put it together. In the last installment we had an interpret function that used HuffTree constructors to build a data structure. In this case, we replace those constructors with our two functions:


interpret' :: Program -> Code -> Result
interpret' = head . foldl exec []
  where
    exec (rt:lt:xs) '_' = branch lt rt : xs
    exec xs c = leaf [c] : xs

Instead of building a data structure, we're building a call graph -- a graph of function calls that trace the decoding of a code into a character and a remaining code (the Result type). Each call to interpret' with a Program and a Code argument produces one decoded letter and the rest like this:


*Huffman> let prog = compile $ buildDecTree "hello world"
*Huffman> prog
"rw_eh__l d_o___"
*Huffman> let code = encode "hello world"
*Huffman> code
"01101010101111100001111000101101"
*Huffman> interpret' prog code
('h',"01010101111100001111000101101")

Subsequent calls to interpret' continue the process


*Huffman> interpret' prog (snd it)
('e',"10101111100001111000101101")
*Huffman> interpret' prog (snd it)
('l',"101111100001111000101101")

So, to wrap it all up, we recursively build up the decoded message until we've run out of code:


decode' :: Program -> Code -> Message
decode' prog code = runInterp "" code
  where
    runInterp :: Message -> Code -> Message
    runInterp s [] = s
    runInterp s cd = uncurry (\c rest -> runInterp (s++[c]) rest) $ interpret' prog cd

And we can see that it works:


*Huffman> decode' prog code
"hello world"

Ta da!

Okay, now come the caveats.

First, this was begun as an exercise for myself, just to see if I could do it. It was pretty fun, and I'm happy with the results. But, being done just for me (despite my sharing with you) it's probably got all sorts of problems...

Second, this type of deforestation, as far as I can tell, is not necessarily a gain. For example, depending on the semantics of the language it's implemented in, you may end up just building a whole pile of thunks on the stack that don't actually do you any good. In other words, it may all be for naught... building the intermediate data structure may be just as good or better, and is certainly more clear and understandable.

Third, I really ran through the end here. This is at least partially because I wanted to get done, but also because Apfelmus does an excellent job, much better than I could do, of explaining this. I encourage you to read his post.

Comments, suggestions etc are very very welcome. And the code is available in various stages on my git server.

A lamb, duh!

Wednesday, September 8, 2010

Decoding Huffman codes without the tree

No comments:

Post a Comment