Sunday, August 22, 2010

Serializing Huffman Trees

If you've been following, then you know I'm playing with Huffman trees and attempting follow Heinrich Apfelmus' morse code example.

Previously, I built a Huffman Tree data type and some functions for encoding and decoding messages using that structure. This installment is rather short, covering a way to serialize and deserialize a HuffTree structure. This serves two purposes. First, it's a way to encode the Huffman tree so that the recipient of a message can decode it. I make no claims as to its efficiency or anything like that. It works, and is really more of an exercise for myself. The second reason is in my pursuit of mimicking Apfelmus' work, I want to effectively obscure the Huffman tree structure, making it more or less vanish into the call structure.

Compiling a HuffTree and Interpreting the Code



The technique here is pretty straightforward. We will traverse the tree, using a post-order traversal, generating a code to be later used for rebuilding the tree. The code is interpreted as a stack based language. In the language, any symbol, other than '_' (underscore) encodes a push operation of that character onto the stack. An underscore means pop two items off the stack, create a Branch node with them and push the new node onto the stack. As these operations are repeated, assuming a well-formed code, the result will be one node left on the stack containing the reconstructed tree.

Both functions are very simple:


-- here we emit an "_" to signal a branch. This "_" will be used to do
-- a pop/merge/push when we interpret the code later
compile :: HuffTree -> String
compile (Leaf sy _) = [sy]
compile (Branch _ lt rt) = compile lt ++ compile rt ++ "_"


interpret :: [Char] -> HuffTree
interpret = head . foldl exec []
where
exec :: [HuffTree] -> Char -> [HuffTree]
exec (rt:lt:xs) '_' = (Branch 0 lt rt) : xs
exec xs c = (Leaf c 0) : xs



For compiling, simple pattern matching in a recursive tree traversal does the trick. For interpreting, a foldl with a constructor function is all that's needed. Note that the interpret function is stolen almost directly from Apfelmus' work, except that the Branch and Leaf operations are swapped.

These functions work just fine:


*Huffman> let tree = buildDecTree "hello world"
*Huffman> tree
Branch
Branch
Branch
Leaf r
Leaf w
Branch
Leaf e
Leaf h
Branch
Leaf l
Branch
Branch
Leaf
Leaf d
Leaf o

*Huffman> compile tree
"rw_eh__l d_o___"
*Huffman> interpret "rw_eh__l d_o___"
Branch
Branch
Branch
Leaf r
Leaf w
Branch
Leaf e
Leaf h
Branch
Leaf l
Branch
Branch
Leaf
Leaf d
Leaf o



In this compile-interpret sequence, one thing is lost -- the weight of the nodes. This is acceptable because the weights are only used for building the initial tree based upon the character frequencies of the encoded message. Once that step is done, the weights are simply unnecessary.

It would be nice to be able to compare trees for equality, if for no other purpose than to get the machine to tell us that interpret and compile are inverses (modulo the weights). So our HuffTree type needs to be an instance of Eq with an (==) function that is meaningful for our purposes. In this case we want structural equality with the values in the leaf nodes being compared directly. Weights are ignored.


instance Eq HuffTree where
(==) (Branch _ l1 r1) (Branch _ l2 r2) = l1 == l2 && r1 == r2
(==) (Leaf s1 _) (Leaf s2 _) = s1 == s2
(==) _ _ = False


The goal here is trees that are equal should encode and decode in the same way. So the shape of the tree is important. That's what this equality function does.


*Huffman> buildDecTree "hello world" == (interpret . compile $ buildDecTree "hello world")
True
*Huffman> buildDecTree "hello world" == (interpret . compile $ buildDecTree $ reverse "hello world")
True


And we can see that this works. Even for the reversed message, because the tree built from this message has the same structure.

That's it for this installment -- a short one. Next time I'll start messing with the deforestation/fusion stuff.

No comments:

Post a Comment