Sunday, August 22, 2010

Serializing Huffman Trees

If you've been following, then you know I'm playing with Huffman trees and attempting follow Heinrich Apfelmus' morse code example.

Previously, I built a Huffman Tree data type and some functions for encoding and decoding messages using that structure. This installment is rather short, covering a way to serialize and deserialize a HuffTree structure. This serves two purposes. First, it's a way to encode the Huffman tree so that the recipient of a message can decode it. I make no claims as to its efficiency or anything like that. It works, and is really more of an exercise for myself. The second reason is in my pursuit of mimicking Apfelmus' work, I want to effectively obscure the Huffman tree structure, making it more or less vanish into the call structure.

Compiling a HuffTree and Interpreting the Code

The technique here is pretty straightforward. We will traverse the tree, using a post-order traversal, generating a code to be later used for rebuilding the tree. The code is interpreted as a stack based language. In the language, any symbol, other than '_' (underscore) encodes a push operation of that character onto the stack. An underscore means pop two items off the stack, create a Branch node with them and push the new node onto the stack. As these operations are repeated, assuming a well-formed code, the result will be one node left on the stack containing the reconstructed tree.

Both functions are very simple:

-- here we emit an "_" to signal a branch. This "_" will be used to do
-- a pop/merge/push when we interpret the code later
compile :: HuffTree -> String
compile (Leaf sy _) = [sy]
compile (Branch _ lt rt) = compile lt ++ compile rt ++ "_"

interpret :: [Char] -> HuffTree
interpret = head . foldl exec []
exec :: [HuffTree] -> Char -> [HuffTree]
exec (rt:lt:xs) '_' = (Branch 0 lt rt) : xs
exec xs c = (Leaf c 0) : xs

For compiling, simple pattern matching in a recursive tree traversal does the trick. For interpreting, a foldl with a constructor function is all that's needed. Note that the interpret function is stolen almost directly from Apfelmus' work, except that the Branch and Leaf operations are swapped.

These functions work just fine:

*Huffman> let tree = buildDecTree "hello world"
*Huffman> tree
Leaf r
Leaf w
Leaf e
Leaf h
Leaf l
Leaf d
Leaf o

*Huffman> compile tree
"rw_eh__l d_o___"
*Huffman> interpret "rw_eh__l d_o___"
Leaf r
Leaf w
Leaf e
Leaf h
Leaf l
Leaf d
Leaf o

In this compile-interpret sequence, one thing is lost -- the weight of the nodes. This is acceptable because the weights are only used for building the initial tree based upon the character frequencies of the encoded message. Once that step is done, the weights are simply unnecessary.

It would be nice to be able to compare trees for equality, if for no other purpose than to get the machine to tell us that interpret and compile are inverses (modulo the weights). So our HuffTree type needs to be an instance of Eq with an (==) function that is meaningful for our purposes. In this case we want structural equality with the values in the leaf nodes being compared directly. Weights are ignored.

instance Eq HuffTree where
(==) (Branch _ l1 r1) (Branch _ l2 r2) = l1 == l2 && r1 == r2
(==) (Leaf s1 _) (Leaf s2 _) = s1 == s2
(==) _ _ = False

The goal here is trees that are equal should encode and decode in the same way. So the shape of the tree is important. That's what this equality function does.

*Huffman> buildDecTree "hello world" == (interpret . compile $ buildDecTree "hello world")
*Huffman> buildDecTree "hello world" == (interpret . compile $ buildDecTree $ reverse "hello world")

And we can see that this works. Even for the reversed message, because the tree built from this message has the same structure.

That's it for this installment -- a short one. Next time I'll start messing with the deforestation/fusion stuff.

Tuesday, August 17, 2010

Huffman Coding in Haskell

To follow up on my post from last week, here is the beginnings of some fun with Huffman Coding in Haskell. Remember that this was spurred by Heinrich Apfelmus' article using fusion and other fun things with Morse Code.

Huffman Coding

There is nothing revolutionary in here, just a reasonably straight-up Huffman coding algorithm that converts a message into a string of 1's and 0's and then can convert them back to the original message. I borrowed heavily from this blog posting and if you look at the history of my source code you'll see that it was a HUGE help! The code snippets below are just that, snippets. Look to the complete source code to get a working example.

Anyway, the basics are as follows.... Given a message, we count the frequency of each character and build up a dictionary of characters and their respective frequencies.

-- the list of character
type FreqList = [(Char, Int)]

-- function to count character frequency, storing the results in a map.
charFreq :: String -> FreqList
charFreq s = Data.Map.toList $ foldl charFreq' Data.Map.empty s

charFreq' :: Map Char Int -> Char -> Map Char Int
charFreq' m c = Data.Map.insertWith (+) c 1 m

These frequencies are later used as weights for building the Huffman Tree. A Huffman tree is a binary tree built so that higher frequency characters are more shallow leaves and lower frequency characters are deeper leaves. This means that high frequency characters are represented with fewer bits than low frequency characters. This gives an overall shorter stream of bits for encoding the message.

So, we need a data type to represent the Huffman Tree:

-- data structure for huffman trees
data HuffTree = Branch { wt :: Int, l :: HuffTree, r :: HuffTree }
| Leaf { symbol :: Char, wt :: Int }
deriving (Eq)

instance Ord HuffTree where
compare = compare `on` wt

mappend :: HuffTree -> HuffTree -> HuffTree
mappend x y = Branch (wt x + wt y) x y

We've made HuffTree an instance of Ord so that it can be sorted on its wt (weight) field. This is useful for building the tree -- we can keep the nodes in the order we want easily. And the function mappend is handy for "adding" HuffTrees together.

The process for building a Huffman Tree from a list of characters and weights is pretty straightforward. We create a list of HuffTrees, using just Leaf constructors and storing their weights. This list is sorted by weight so that the Leaf nodes that represent the characters with the lowest frequency or weight are first. These HuffTree elements are pulled off the list, combined with the `mappend` function and then sorted back into the list. This process continues until there is only one HuffTree left in the list. This is the Huffman coding tree for our message.

-- produce a decoding tree from a string
buildDecTree :: String -> HuffTree
buildDecTree = build . sort . map (uncurry Leaf) . charFreq
build :: [ HuffTree ] -> HuffTree
build [] = error "Empty Leaf list"
build (t:[]) = t
build (x:y:ts) = build $ Data.List.insert (mappend x y) ts

As usual, Haskell make this sort of operation ridiculously simple and intuitive.

with an instance of Show for HuffTree, we can see the results:

*Huffman> buildDecTree "hello world"
Branch 11
Branch 4
Branch 2
Leaf r 1
Leaf w 1
Branch 2
Leaf e 1
Leaf h 1
Branch 7
Leaf l 3
Branch 4
Branch 2
Leaf 1
Leaf d 1
Leaf o 2

To encode a character using the generated tree, we record the path taken to get to the character emitting a "0" for left branches taken and a "1" for right branches. In the above tree, where the first branch listed is the left branch, the path to the letter "e" is left -> right -> left. Thus the code for "e" is "010", only three bits to represent an 8-bit character. To encode an entire message, we repeat the process for each letter. But traversing the tree for each letter is neither efficient nor elegant. So we flatten the tree into a dictionary of characters mapped to their encodings. To do this, we traverse the tree once, recursively recording and emitting dictionary entries.

-- produce an encoding dictionary from the decoding tree
buildEncDict :: HuffTree -> [(Char, String)]
buildEncDict = buildEncDict' ""

buildEncDict' :: String -> HuffTree -> [(Char, String)]
buildEncDict' s (Leaf sy _) = [(sy,s)]
buildEncDict' s (Branch _ lt rt) = buildEncDict' (s ++ "0") lt
++ buildEncDict' (s ++ "1") rt

Now to encode a character, we simply to a lookup in the dictionary. In this case, it's just a list of character-code pairs but it could easily be, for a larger character set, a search tree keyed by the character storing code values. As an aside, this would be an interesting exercise, to traverse the Huffman tree transforming it into a binary search tree. Regardless, encoding is very straightforward:

-- here is an encoding function to turn a String into a Code
encode :: String -> Code
encode s = encode' (buildEncDict $ buildDecTree s) s

encode' :: [(Char, String)] -> String -> Code
encode' _ [] = []
encode' d (s:ss) = (fromJust $ Prelude.lookup s d) ++ encode' d ss

And finally, to decode a message, you traverse the HuffmanTree directly, turning 0's into left branches and 1's into right branches in the tree traversal. When the traversal reaches a Leaf node, then a character has been decoded.

-- and a decode function to turn a code and a tree into a string
decode :: HuffTree -> Code -> String
decode t code = decode' t code
decode' (Branch _ lt rt) (c:cs) =
case c of
'0' -> decode' lt cs
'1' -> decode' rt cs
_ -> [] -- "otherwise" produces a warning about unused variable
decode' (Leaf s _) cs = s : decode' t cs -- we need the whole tree again
decode' _ [] = []

Note the use of a helper function because to decode each letter, you have to start from the root of the tree. There are likely better ways to do this...

That's the basics of Huffman coding in Haskell, and it works just fine:

*Huffman> let tree = buildDecTree "hello world"
*Huffman> encode "hello world"
*Huffman> decode tree $ encode "hello world"
"hello world"

But there is a problem. To decode the message, you need the coding tree for that message. I'm sure there are a variety of techniques to transmit the tree as part of the message. One of the things Apfelmus did that intrigued me was develop a stack based language and interpreter to represent the morse code tree as a string of characters. I think this would make a fine way to encode the tree into the front of the message. We should be able to extract a string of characters from the Huffman tree for a particular message and use that string of characters to reconstruct the tree on the other side. Now this will add some significant overhead to the output for a given message, but I'm not necessarily concerned with space efficiency here as much as learning about these kinds of transformations.


I previously mentioned some differences between Apfelmus' implementation of the morse code decoder and a similar Huffman decoder. Having gone through the process of building this code, I"ve come to better recognize those differences.

First, as I mentioned before, in the Huffman tree, the leaves are important and the result is an encoding of the routes through the tree to obtain the desired characters. We use the Leaf nodes to know when a particular character has been decoded because the code is an undelineated string of bits. In the morse code solution, the "letters" are delineated. It is clear when a particular letter is complete. So the decoding works slightly differently and the tree is really a trie, where the result is whatever character you land on when the input stream for a particular character ends. I'm not sure what the implications will be for the application of the fusion technique, but we'll see.

When decoding, you don't strictly need the weights anymore, they are only used to build the tree. So transmitting the decoding tree is simpler than it seems at first glance. Again we'll see what comes of that as I progress through this.

Next installment should be a "compiler" to turn a HuffTree into a string of characters that can be used to reconstruct the tree again using an "interpreter". Should be fun!

Sunday, August 8, 2010

A Haskell approach to Huffman Coding

I just ran across Heinrich Apfelmus' discussion of morse code and I think it's pretty cool. I like the progression from a straightforward initial approach to a very clever result. I was really struck by the similarities of parsing morse code with a tree and Huffman Coding. My goal this week is to try to apply Apfelmus' technique to Huffman coding.

There are some important distinctions between the morse code tree and Huffman Coding. First, in Huffman coding, the leaves are important. The symbols are stored in the leaves of the tree instead of in the nodes as in the morse code solution. Additionally, the Huffman tree structure itself encodes the frequency of the symbols in the source, a consideration that is not required in the morse code. This changes the interpreter required for constructing the tree. Also, the Huffman coding tree should be constructed specifically for the data set it is being used to encode. For a reasonably consistent data set, like large amounts of English text, the frequency of symbols for the language in general can certainly be used with reasonable results. But, for encoding other kinds of data, ideally the source should be used to generate frequencies of symbols unique to that source. I'm probably going to ignore this and just use some reasonable standard frequencies instead.

So, that's my plan for the week, we'll see how it goes.