Wednesday, September 15, 2010

\$100 Words

My eleven year old daughter came home with a homework problem. The task is to find \$100 words. A \$100 word is a word whose letters, when assigned values, sum to 100. Each one she finds is worth "\$100" in their little game at school. Okay cool! But the values are assigned like this:


  • a = 1

  • b = 3

  • ...
  • m = 25

  • n = 26

  • o = 24

  • ...
  • z = 2



This is not exactly the most intuitive way to number the alphabet, and it certainly makes it hard to evaluate a given word as the typical mapping we're used to (a=1, b=2...) doesn't help. I gave up after a few minutes and told her she should learn to program to solve the problem.

She wasn't too thrilled with that idea, but here's my simple Haskell solution:


import Char (isUpper)

values :: [(Char, Int)]
values = zip (interleave ['a'..'m'] ['z','y'..'n']) [1..26]
where
interleave :: [a] -> [a] -> [a]
interleave (x:xs) (y:ys) = x : y : interleave xs ys
interleave _ _ = []

charValue :: Char -> Int
charValue c = case lookup c values of
Nothing -> 0
Just v -> v

wordValue :: String -> Int
wordValue = foldl (\s c -> s + charValue c) 0

goodWord :: String -> Bool
goodWord word@(x:_) = (not $ isUpper x) && wordValue word == 100
goodWord [] = False

main :: IO ()
main = interact (unlines . filter goodWord . lines)


Just feed it a list of words, like the handy /etc/dictionaries-common/words and away it goes...


$ ./DollarWords < /etc/dictionaries-common/words
abductor's
abductors
abidings
...

$ ./DollarWords < /etc/dictionaries-common/words | wc -l
652



So, if I was to unleash this on her poor unsuspecting teachers, her team would rocket ahead with \$652,000! But I guess that wouldn't be fair since she didn't write the program...

I'd love to see other versions of solutions for this problem.

Wednesday, September 8, 2010

Decoding Huffman codes without the tree

Okay, so, last time I demonstrated how to serialize a Huffman decoding tree into a simple stack-based language for rebuilding the tree. This was pretty interesting in it's own right, in my opinion, but was only a step down the road to the material in this installment... how to decode the Huffman code without the Huffman decoding tree. That is, the intervening data structure, the Huffman decoding tree, is completely eliminated, replacing it with a call graph that does the decoding instead.

Like everything so far, a lot is owed to Heinrich Apfelmus' morse code example, from which this work is directly taken.

This technique of removing the intermediate data structure is called deforestation. In this case, when decoding a string of encoded characters, the Huffman decoding tree is built, and then traversed to find a decoded letter. The construction of the tree is a recursive application of the tree constructors starting at the root of the tree. This construction actually happens backards, starting with leaf nodes which are assembled into branch nodes repeatedly until there is only one branch node, the root, left. In a lazy language like Haskell, these nodes may not be evaluated, but instead left as a tree of thunks, unevaluated function calls, where the function calls are calls to the node constructor. Then the decoding is a recursive traversal of that tree looking for a leaf node to decode the letter.

The construction of the tree looks essentially like this:


(Branch (Branch (Leaf 'a') (Leaf 'b')) (Branch (Leaf 'c') (Branch (Leaf 'd') (Leaf 'e')))


(with some stuff left out for clarity).

Apfelmus' replaces the constructor calls with functions that tell what to do in the case of reading a particular character from the input stream. And this is where one of the fundamental differences between the morse code tree and the Huffman coding tree come to light. The morse code tree is essentially a trie -- the resulting character returned is based on where you are in the tree when you run out of input (morse code is broken into "words" for each letter). Each node in the morse code tree has a letter associated with it. Each dash or dot moves you further into the tree and to a different letter (allowing, I suppose, a form of partial result, though it has no meaning). When the symbols in a particular morse code "word" are used up, whatever node you're looking at is the character that has been decoded.

To implement this, Apfelmus' uses the following, (here I use 1's and 0's instead of dashes and dots):


branch c x y = \code -> case code of
'0':ds -> x ds
'1':ds -> y ds
[] -> c

leaf = undefined


The leaf can be undefined because any well-formed morse code will never reach the leaves of the tree. And if there is no input left, the last case in the branch function, then you're done, emit the character. I won't go into the details of how these functions are used, you can read his post yourself.

In the Huffman tree, this is not the case. First, the incoming code is not broken into "words". It is all one stream. Secondly, the decoding happens at the leaf nodes so the leaf function has to actually do something. Finally, in the decoding process, we have to keep track of what hasn't been decoded so far because we'll need it for the next letter.

The result looks like this:


type Code = String -- an encoded message
type Result = (Char, Code)

branch :: (Code -> Result) -> (Code -> Result) -> (Code -> Result)
branch x y = \code -> case code of
'0':ds -> x ds
'1':ds -> y ds


leaf :: Code -> (Code -> Result)
leaf [c] = \code -> (c, code)


Note here the difference. The branch function never returns a decoded character. That is pushed into the leaf function which also returns the unprocessed remaining portion of the code.

Now to put it together. In the last installment we had an interpret function that used HuffTree constructors to build a data structure. In this case, we replace those constructors with our two functions:


interpret' :: Program -> Code -> Result
interpret' = head . foldl exec []
where
exec (rt:lt:xs) '_' = branch lt rt : xs
exec xs c = leaf [c] : xs


Instead of building a data structure, we're building a call graph -- a graph of function calls that trace the decoding of a code into a character and a remaining code (the Result type). Each call to interpret' with a Program and a Code argument produces one decoded letter and the rest like this:


*Huffman> let prog = compile $ buildDecTree "hello world"
*Huffman> prog
"rw_eh__l d_o___"
*Huffman> let code = encode "hello world"
*Huffman> code
"01101010101111100001111000101101"
*Huffman> interpret' prog code
('h',"01010101111100001111000101101")


Subsequent calls to interpret' continue the process


*Huffman> interpret' prog (snd it)
('e',"10101111100001111000101101")
*Huffman> interpret' prog (snd it)
('l',"101111100001111000101101")


So, to wrap it all up, we recursively build up the decoded message until we've run out of code:


decode' :: Program -> Code -> Message
decode' prog code = runInterp "" code
where
runInterp :: Message -> Code -> Message
runInterp s [] = s
runInterp s cd = uncurry (\c rest -> runInterp (s++[c]) rest) $ interpret' prog cd


And we can see that it works:


*Huffman> decode' prog code
"hello world"


Ta da!

Okay, now come the caveats.

First, this was begun as an exercise for myself, just to see if I could do it. It was pretty fun, and I'm happy with the results. But, being done just for me (despite my sharing with you) it's probably got all sorts of problems...

Second, this type of deforestation, as far as I can tell, is not necessarily a gain. For example, depending on the semantics of the language it's implemented in, you may end up just building a whole pile of thunks on the stack that don't actually do you any good. In other words, it may all be for naught... building the intermediate data structure may be just as good or better, and is certainly more clear and understandable.

Third, I really ran through the end here. This is at least partially because I wanted to get done, but also because Apfelmus does an excellent job, much better than I could do, of explaining this. I encourage you to read his post.

Comments, suggestions etc are very very welcome. And the code is available in various stages on my git server.

Sunday, August 22, 2010

Serializing Huffman Trees

If you've been following, then you know I'm playing with Huffman trees and attempting follow Heinrich Apfelmus' morse code example.

Previously, I built a Huffman Tree data type and some functions for encoding and decoding messages using that structure. This installment is rather short, covering a way to serialize and deserialize a HuffTree structure. This serves two purposes. First, it's a way to encode the Huffman tree so that the recipient of a message can decode it. I make no claims as to its efficiency or anything like that. It works, and is really more of an exercise for myself. The second reason is in my pursuit of mimicking Apfelmus' work, I want to effectively obscure the Huffman tree structure, making it more or less vanish into the call structure.

Compiling a HuffTree and Interpreting the Code



The technique here is pretty straightforward. We will traverse the tree, using a post-order traversal, generating a code to be later used for rebuilding the tree. The code is interpreted as a stack based language. In the language, any symbol, other than '_' (underscore) encodes a push operation of that character onto the stack. An underscore means pop two items off the stack, create a Branch node with them and push the new node onto the stack. As these operations are repeated, assuming a well-formed code, the result will be one node left on the stack containing the reconstructed tree.

Both functions are very simple:


-- here we emit an "_" to signal a branch. This "_" will be used to do
-- a pop/merge/push when we interpret the code later
compile :: HuffTree -> String
compile (Leaf sy _) = [sy]
compile (Branch _ lt rt) = compile lt ++ compile rt ++ "_"


interpret :: [Char] -> HuffTree
interpret = head . foldl exec []
where
exec :: [HuffTree] -> Char -> [HuffTree]
exec (rt:lt:xs) '_' = (Branch 0 lt rt) : xs
exec xs c = (Leaf c 0) : xs



For compiling, simple pattern matching in a recursive tree traversal does the trick. For interpreting, a foldl with a constructor function is all that's needed. Note that the interpret function is stolen almost directly from Apfelmus' work, except that the Branch and Leaf operations are swapped.

These functions work just fine:


*Huffman> let tree = buildDecTree "hello world"
*Huffman> tree
Branch
Branch
Branch
Leaf r
Leaf w
Branch
Leaf e
Leaf h
Branch
Leaf l
Branch
Branch
Leaf
Leaf d
Leaf o

*Huffman> compile tree
"rw_eh__l d_o___"
*Huffman> interpret "rw_eh__l d_o___"
Branch
Branch
Branch
Leaf r
Leaf w
Branch
Leaf e
Leaf h
Branch
Leaf l
Branch
Branch
Leaf
Leaf d
Leaf o



In this compile-interpret sequence, one thing is lost -- the weight of the nodes. This is acceptable because the weights are only used for building the initial tree based upon the character frequencies of the encoded message. Once that step is done, the weights are simply unnecessary.

It would be nice to be able to compare trees for equality, if for no other purpose than to get the machine to tell us that interpret and compile are inverses (modulo the weights). So our HuffTree type needs to be an instance of Eq with an (==) function that is meaningful for our purposes. In this case we want structural equality with the values in the leaf nodes being compared directly. Weights are ignored.


instance Eq HuffTree where
(==) (Branch _ l1 r1) (Branch _ l2 r2) = l1 == l2 && r1 == r2
(==) (Leaf s1 _) (Leaf s2 _) = s1 == s2
(==) _ _ = False


The goal here is trees that are equal should encode and decode in the same way. So the shape of the tree is important. That's what this equality function does.


*Huffman> buildDecTree "hello world" == (interpret . compile $ buildDecTree "hello world")
True
*Huffman> buildDecTree "hello world" == (interpret . compile $ buildDecTree $ reverse "hello world")
True


And we can see that this works. Even for the reversed message, because the tree built from this message has the same structure.

That's it for this installment -- a short one. Next time I'll start messing with the deforestation/fusion stuff.

Tuesday, August 17, 2010

Huffman Coding in Haskell

To follow up on my post from last week, here is the beginnings of some fun with Huffman Coding in Haskell. Remember that this was spurred by Heinrich Apfelmus' article using fusion and other fun things with Morse Code.

Huffman Coding


There is nothing revolutionary in here, just a reasonably straight-up Huffman coding algorithm that converts a message into a string of 1's and 0's and then can convert them back to the original message. I borrowed heavily from this blog posting and if you look at the history of my source code you'll see that it was a HUGE help! The code snippets below are just that, snippets. Look to the complete source code to get a working example.

Anyway, the basics are as follows.... Given a message, we count the frequency of each character and build up a dictionary of characters and their respective frequencies.


-- the list of character
frequencies
type FreqList = [(Char, Int)]

-- function to count character frequency, storing the results in a map.
charFreq :: String -> FreqList
charFreq s = Data.Map.toList $ foldl charFreq' Data.Map.empty s

charFreq' :: Map Char Int -> Char -> Map Char Int
charFreq' m c = Data.Map.insertWith (+) c 1 m


These frequencies are later used as weights for building the Huffman Tree. A Huffman tree is a binary tree built so that higher frequency characters are more shallow leaves and lower frequency characters are deeper leaves. This means that high frequency characters are represented with fewer bits than low frequency characters. This gives an overall shorter stream of bits for encoding the message.

So, we need a data type to represent the Huffman Tree:


-- data structure for huffman trees
data HuffTree = Branch { wt :: Int, l :: HuffTree, r :: HuffTree }
| Leaf { symbol :: Char, wt :: Int }
deriving (Eq)

instance Ord HuffTree where
compare = compare `on` wt

mappend :: HuffTree -> HuffTree -> HuffTree
mappend x y = Branch (wt x + wt y) x y


We've made HuffTree an instance of Ord so that it can be sorted on its wt (weight) field. This is useful for building the tree -- we can keep the nodes in the order we want easily. And the function mappend is handy for "adding" HuffTrees together.

The process for building a Huffman Tree from a list of characters and weights is pretty straightforward. We create a list of HuffTrees, using just Leaf constructors and storing their weights. This list is sorted by weight so that the Leaf nodes that represent the characters with the lowest frequency or weight are first. These HuffTree elements are pulled off the list, combined with the `mappend` function and then sorted back into the list. This process continues until there is only one HuffTree left in the list. This is the Huffman coding tree for our message.


-- produce a decoding tree from a string
buildDecTree :: String -> HuffTree
buildDecTree = build . sort . map (uncurry Leaf) . charFreq
where
build :: [ HuffTree ] -> HuffTree
build [] = error "Empty Leaf list"
build (t:[]) = t
build (x:y:ts) = build $ Data.List.insert (mappend x y) ts


As usual, Haskell make this sort of operation ridiculously simple and intuitive.

with an instance of Show for HuffTree, we can see the results:


*Huffman> buildDecTree "hello world"
Branch 11
Branch 4
Branch 2
Leaf r 1
Leaf w 1
Branch 2
Leaf e 1
Leaf h 1
Branch 7
Leaf l 3
Branch 4
Branch 2
Leaf 1
Leaf d 1
Leaf o 2


To encode a character using the generated tree, we record the path taken to get to the character emitting a "0" for left branches taken and a "1" for right branches. In the above tree, where the first branch listed is the left branch, the path to the letter "e" is left -> right -> left. Thus the code for "e" is "010", only three bits to represent an 8-bit character. To encode an entire message, we repeat the process for each letter. But traversing the tree for each letter is neither efficient nor elegant. So we flatten the tree into a dictionary of characters mapped to their encodings. To do this, we traverse the tree once, recursively recording and emitting dictionary entries.


-- produce an encoding dictionary from the decoding tree
buildEncDict :: HuffTree -> [(Char, String)]
buildEncDict = buildEncDict' ""

buildEncDict' :: String -> HuffTree -> [(Char, String)]
buildEncDict' s (Leaf sy _) = [(sy,s)]
buildEncDict' s (Branch _ lt rt) = buildEncDict' (s ++ "0") lt
++ buildEncDict' (s ++ "1") rt


Now to encode a character, we simply to a lookup in the dictionary. In this case, it's just a list of character-code pairs but it could easily be, for a larger character set, a search tree keyed by the character storing code values. As an aside, this would be an interesting exercise, to traverse the Huffman tree transforming it into a binary search tree. Regardless, encoding is very straightforward:


-- here is an encoding function to turn a String into a Code
encode :: String -> Code
encode s = encode' (buildEncDict $ buildDecTree s) s

encode' :: [(Char, String)] -> String -> Code
encode' _ [] = []
encode' d (s:ss) = (fromJust $ Prelude.lookup s d) ++ encode' d ss


And finally, to decode a message, you traverse the HuffmanTree directly, turning 0's into left branches and 1's into right branches in the tree traversal. When the traversal reaches a Leaf node, then a character has been decoded.


-- and a decode function to turn a code and a tree into a string
decode :: HuffTree -> Code -> String
decode t code = decode' t code
where
decode' (Branch _ lt rt) (c:cs) =
case c of
'0' -> decode' lt cs
'1' -> decode' rt cs
_ -> [] -- "otherwise" produces a warning about unused variable
decode' (Leaf s _) cs = s : decode' t cs -- we need the whole tree again
decode' _ [] = []


Note the use of a helper function because to decode each letter, you have to start from the root of the tree. There are likely better ways to do this...

That's the basics of Huffman coding in Haskell, and it works just fine:


*Huffman> let tree = buildDecTree "hello world"
*Huffman> encode "hello world"
"01101010101111100001111000101101"
*Huffman> decode tree $ encode "hello world"
"hello world"


But there is a problem. To decode the message, you need the coding tree for that message. I'm sure there are a variety of techniques to transmit the tree as part of the message. One of the things Apfelmus did that intrigued me was develop a stack based language and interpreter to represent the morse code tree as a string of characters. I think this would make a fine way to encode the tree into the front of the message. We should be able to extract a string of characters from the Huffman tree for a particular message and use that string of characters to reconstruct the tree on the other side. Now this will add some significant overhead to the output for a given message, but I'm not necessarily concerned with space efficiency here as much as learning about these kinds of transformations.

Observations


I previously mentioned some differences between Apfelmus' implementation of the morse code decoder and a similar Huffman decoder. Having gone through the process of building this code, I"ve come to better recognize those differences.

First, as I mentioned before, in the Huffman tree, the leaves are important and the result is an encoding of the routes through the tree to obtain the desired characters. We use the Leaf nodes to know when a particular character has been decoded because the code is an undelineated string of bits. In the morse code solution, the "letters" are delineated. It is clear when a particular letter is complete. So the decoding works slightly differently and the tree is really a trie, where the result is whatever character you land on when the input stream for a particular character ends. I'm not sure what the implications will be for the application of the fusion technique, but we'll see.

When decoding, you don't strictly need the weights anymore, they are only used to build the tree. So transmitting the decoding tree is simpler than it seems at first glance. Again we'll see what comes of that as I progress through this.

Next installment should be a "compiler" to turn a HuffTree into a string of characters that can be used to reconstruct the tree again using an "interpreter". Should be fun!

Sunday, August 8, 2010

A Haskell approach to Huffman Coding

I just ran across Heinrich Apfelmus' discussion of morse code and I think it's pretty cool. I like the progression from a straightforward initial approach to a very clever result. I was really struck by the similarities of parsing morse code with a tree and Huffman Coding. My goal this week is to try to apply Apfelmus' technique to Huffman coding.

There are some important distinctions between the morse code tree and Huffman Coding. First, in Huffman coding, the leaves are important. The symbols are stored in the leaves of the tree instead of in the nodes as in the morse code solution. Additionally, the Huffman tree structure itself encodes the frequency of the symbols in the source, a consideration that is not required in the morse code. This changes the interpreter required for constructing the tree. Also, the Huffman coding tree should be constructed specifically for the data set it is being used to encode. For a reasonably consistent data set, like large amounts of English text, the frequency of symbols for the language in general can certainly be used with reasonable results. But, for encoding other kinds of data, ideally the source should be used to generate frequencies of symbols unique to that source. I'm probably going to ignore this and just use some reasonable standard frequencies instead.

So, that's my plan for the week, we'll see how it goes.

Monday, March 22, 2010

You know you're a geek when...

You know you're a geek when you catch yourself trying to explain to your 12 year old kid how the pig-latin transformation is not an isomorphism, but is a function (at least it seems so at first glance). And then things diverge into whether one is discussing typographical or aural pig-latin.

The anecdotal evidence is pretty clear: the English word "pay" becomes "ay-pay" while the English word "ape" becomes "ape-ay"*. Typographically they are quite distinct, but aurally they can be quite similar, even identical, depending on one's speech patterns. So, spoken pig-latin is not an isomorphism, relying on contextual clues to allow easy translation. But typographical pig-latin should be very straightforward, mechanical, to translate. I'm not even going to go into the occasional pig-latinized French that we have to bust out around here when the kids are eavesdropping.

Anyway, that's just a little food for thought ;)


* remember that words which begin with a vowel sound merely have the "ay" syllable appended to it.

Saturday, January 30, 2010

Grad School! and other stuff

I've been letting this languish for a variety of reasons, not the least because I'm just really busy. Carrying 18 credits at school while trying to prep a house for sale is not for the faint-of-heart. So, here is a bit of an update one what I'm not quite getting accomplished.

Grad School -- I just got my first grad school acceptance letter. That makes it official that I'm headed to grad school in the fall. Yay! But at the same time that makes everything else much more critical.

House, Business -- I'm trying to sell my businesses and our house. Both of these are fairly huge tasks in their own right. I've got some interest in the bar that may pan out to some kind of sale soon, but who knows. The house is an effort in trash hauling, mostly. It's mazing how much crap can accumulate in 10 years! We've got to get about 2/3 of our stuff *out* of the house to make it look acceptable for someone to buy. Then we have to somehow *live* without 2/3 of our stuff. Luckily, about half that 2/3's is probably trash of some kind or another. Can you imagine? 1/3 of our stuff is essentially trash! This is going to be good for us in the long run.

School -- I'm carrying a big load at the moment (and next quarter too) in order to get done this spring. It's going to work out, but it's a *lot* of work. Thankfully, some of it's fun. My Senior Project is a great little research project examining the latency of a video system in Linux using gStreamer and a real-time kernel. Pretty cool.

Side stuff -- the reading list gets longer all the time, especially with impending grad school admission. I've got to catch up on important stuff I feel I'm missing from my undergrad education. Specifically, I want to read up on programming language semantics, compiler design, and play around with some OS stuff. I may have to write the typical crappy little OS project that never gets beyond the bootloader. The projection pursuit project is actually pretty much done. I have one little problem to fix and I can call it a finished prototype, but I'm not sure when I'll get to that.

So, that's the update for now. Cheers everyone!