A lamb, duh!: Huffman Coding in Haskell

To follow up on my post from last week, here is the beginnings of some fun with Huffman Coding in Haskell. Remember that this was spurred by Heinrich Apfelmus' article using fusion and other fun things with Morse Code.

Huffman Coding

There is nothing revolutionary in here, just a reasonably straight-up Huffman coding algorithm that converts a message into a string of 1's and 0's and then can convert them back to the original message. I borrowed heavily from this blog posting and if you look at the history of my source code you'll see that it was a HUGE help! The code snippets below are just that, snippets. Look to the complete source code to get a working example.

Anyway, the basics are as follows.... Given a message, we count the frequency of each character and build up a dictionary of characters and their respective frequencies.


-- the list of character
frequencies
type FreqList = [(Char, Int)]

-- function to count character frequency, storing the results in a map.
charFreq :: String -> FreqList
charFreq s = Data.Map.toList $ foldl charFreq' Data.Map.empty s

charFreq' :: Map Char Int -> Char -> Map Char Int
charFreq' m c = Data.Map.insertWith (+) c 1 m

These frequencies are later used as weights for building the Huffman Tree. A Huffman tree is a binary tree built so that higher frequency characters are more shallow leaves and lower frequency characters are deeper leaves. This means that high frequency characters are represented with fewer bits than low frequency characters. This gives an overall shorter stream of bits for encoding the message.

So, we need a data type to represent the Huffman Tree:


-- data structure for huffman trees                                                                                                                                                                        
data HuffTree = Branch { wt :: Int, l :: HuffTree, r :: HuffTree }
       | Leaf { symbol :: Char, wt :: Int }
             deriving (Eq)

instance Ord HuffTree where
        compare = compare `on` wt

mappend :: HuffTree -> HuffTree -> HuffTree
mappend x y = Branch (wt x + wt y) x y

We've made HuffTree an instance of Ord so that it can be sorted on its wt (weight) field. This is useful for building the tree -- we can keep the nodes in the order we want easily. And the function mappend is handy for "adding" HuffTrees together.

The process for building a Huffman Tree from a list of characters and weights is pretty straightforward. We create a list of HuffTrees, using just Leaf constructors and storing their weights. This list is sorted by weight so that the Leaf nodes that represent the characters with the lowest frequency or weight are first. These HuffTree elements are pulled off the list, combined with the `mappend` function and then sorted back into the list. This process continues until there is only one HuffTree left in the list. This is the Huffman coding tree for our message.


-- produce a decoding tree from a string                                                                                                                                                                   
buildDecTree :: String -> HuffTree
buildDecTree = build . sort . map (uncurry Leaf) . charFreq
            where
            build :: [ HuffTree ] -> HuffTree
            build [] = error "Empty Leaf list"
            build (t:[]) = t
            build (x:y:ts) = build $ Data.List.insert (mappend x y) ts

As usual, Haskell make this sort of operation ridiculously simple and intuitive.

with an instance of Show for HuffTree, we can see the results:


*Huffman> buildDecTree "hello world"
Branch 11
 Branch 4
   Branch 2
     Leaf r 1
     Leaf w 1
   Branch 2
     Leaf e 1
     Leaf h 1
 Branch 7
   Leaf l 3
   Branch 4
     Branch 2
       Leaf   1
       Leaf d 1
     Leaf o 2

To encode a character using the generated tree, we record the path taken to get to the character emitting a "0" for left branches taken and a "1" for right branches. In the above tree, where the first branch listed is the left branch, the path to the letter "e" is left -> right -> left. Thus the code for "e" is "010", only three bits to represent an 8-bit character. To encode an entire message, we repeat the process for each letter. But traversing the tree for each letter is neither efficient nor elegant. So we flatten the tree into a dictionary of characters mapped to their encodings. To do this, we traverse the tree once, recursively recording and emitting dictionary entries.


-- produce an encoding dictionary from the decoding tree                                                                                                                                                   
buildEncDict :: HuffTree -> [(Char, String)]
buildEncDict = buildEncDict' ""

buildEncDict' :: String -> HuffTree -> [(Char, String)]
buildEncDict' s (Leaf sy _) = [(sy,s)]
buildEncDict' s (Branch _ lt rt) = buildEncDict' (s ++ "0") lt
                                  ++ buildEncDict' (s ++ "1") rt

Now to encode a character, we simply to a lookup in the dictionary. In this case, it's just a list of character-code pairs but it could easily be, for a larger character set, a search tree keyed by the character storing code values. As an aside, this would be an interesting exercise, to traverse the Huffman tree transforming it into a binary search tree. Regardless, encoding is very straightforward:


-- here is an encoding function to turn a String into a Code                                                                                                                                               
encode :: String -> Code
encode s = encode' (buildEncDict $ buildDecTree s) s

encode' :: [(Char, String)] -> String -> Code
encode' _ [] = []
encode' d (s:ss) = (fromJust $ Prelude.lookup s d) ++ encode' d ss

And finally, to decode a message, you traverse the HuffmanTree directly, turning 0's into left branches and 1's into right branches in the tree traversal. When the traversal reaches a Leaf node, then a character has been decoded.


-- and a decode function to turn a code and a tree into a string                                                                                                                                           
decode :: HuffTree -> Code -> String
decode t code = decode' t code
      where
      decode' (Branch _ lt rt) (c:cs) =
        case c of
        '0' -> decode' lt cs
        '1' -> decode' rt cs
        _ -> [] -- "otherwise" produces a warning about unused variable                                                                                                                             
      decode' (Leaf s _) cs = s : decode' t cs -- we need the whole tree again                                                                                                                            
      decode' _ [] = []

Note the use of a helper function because to decode each letter, you have to start from the root of the tree. There are likely better ways to do this...

That's the basics of Huffman coding in Haskell, and it works just fine:


*Huffman> let tree = buildDecTree "hello world"
*Huffman> encode "hello world"
"01101010101111100001111000101101"
*Huffman> decode tree $ encode "hello world"
"hello world"

But there is a problem. To decode the message, you need the coding tree for that message. I'm sure there are a variety of techniques to transmit the tree as part of the message. One of the things Apfelmus did that intrigued me was develop a stack based language and interpreter to represent the morse code tree as a string of characters. I think this would make a fine way to encode the tree into the front of the message. We should be able to extract a string of characters from the Huffman tree for a particular message and use that string of characters to reconstruct the tree on the other side. Now this will add some significant overhead to the output for a given message, but I'm not necessarily concerned with space efficiency here as much as learning about these kinds of transformations.

Observations

I previously mentioned some differences between Apfelmus' implementation of the morse code decoder and a similar Huffman decoder. Having gone through the process of building this code, I"ve come to better recognize those differences.

First, as I mentioned before, in the Huffman tree, the leaves are important and the result is an encoding of the routes through the tree to obtain the desired characters. We use the Leaf nodes to know when a particular character has been decoded because the code is an undelineated string of bits. In the morse code solution, the "letters" are delineated. It is clear when a particular letter is complete. So the decoding works slightly differently and the tree is really a trie, where the result is whatever character you land on when the input stream for a particular character ends. I'm not sure what the implications will be for the application of the fusion technique, but we'll see.

When decoding, you don't strictly need the weights anymore, they are only used to build the tree. So transmitting the decoding tree is simpler than it seems at first glance. Again we'll see what comes of that as I progress through this.

Next installment should be a "compiler" to turn a HuffTree into a string of characters that can be used to reconstruct the tree again using an "interpreter". Should be fun!

A lamb, duh!

Tuesday, August 17, 2010

Huffman Coding in Haskell

Huffman Coding

Observations

2 comments: