Sometimes problem solving requires just a lot of thought. And I don't mean brute force thought, but the gentle, patient thought that comes from relaxing about a problem and just letting your brain work on it for a while. I find I come up with some of my best ideas in the shower. Yeah, I know... But, when something has really been bothering me -- remaining unsolved despite my best efforts -- the solutions typically come at night. Something wakes me up and then I'm up and my mind won't stop and then the solution comes.
Such it was last night. I haven't had a lot of time to put into this effort lately, and what effort I've made has been frustrating. If you look at the original Octave source code you'll see that there are a lot of parameters controlling the algorithm. We've got the raw data itself, the size of the gradient probe, the proportion of the gradient to use, the number of iterations. It's a lot of stuff to cart around. But, I'll get to that in a moment.
First, I had spent a lot of time writing some matrix and vector operators. This was a good exercise, though I've ultimately thrown all that code out. One crucial part of this algorithm is the use of the right singular vectors from the singular value decomposition. I certainly didn't want to write an svd function for this exercise. Google to the rescue! I found hmatrix, and a solution to all my matrix handling problems. So, using hmatrix caused me to dig back through the bits of code I'd already written. At the core of the projection pursuit algorithm is the calculation of the gradient. This solution uses an estimation of the gradient by probing in all dimensions and measuring the change in the kurtosis.
So, let's define some helper functions:
{-- calculate modified kurtosis --}
kurtosis :: Vector Double -> Double
kurtosis y = mean(y) ** 4 - 3
{-- calculate the mean of a vector --}
mean :: Vector Double -> Double
mean y = go 0 0 $ toList y
where
go :: Double -> Int -> [Double] -> Double
go s l [] = s / fromIntegral l
go s l (x:xs) = go (s+x) (l+1) xs
{- create a list of vectors to probe h distance in all directions -}
hs :: Double -> Int -> [Vector Double]
hs h r = toRows . diag . fromList . take r $ repeat h
Using these, we can calculate the estimated gradient:
{-- calculate the gradient of a function given a bunch of starting information
f - the function to compute gradient of
h - the probe distance... the smaller the better
z - the raw data set needed for future projections
w - a starting vector/point at which to calculate the gradient --}
gradient :: (Vector Double -> Double) -> Double -> Matrix Double -> Vector Double -> Vector Double
gradient f h z w = probe ws
where
ws = map ((+) w) . hs h $ rows z
k = f $ w <> z
probe :: [Vector Double] -> Vector Double
probe = fromList . probe'
probe' :: [Vector Double] -> [Double]
probe' [] = []
probe' (x:xs) = (deltaK x) / h : (probe' $ xs)
deltaK :: Vector Double -> Double
deltaK x = k - f (x <> z)
This is really pretty straightforward. We calculate a starting kurtosis value, generate a bunch a vectors that shift h-amount in all dimensions, calculate the kurtosis for each of those and concatenate them into a new vector for the gradient.
The real breakthrough for me came when I was struggling with how ugly it was going to be to pass around all this junk. Just ugly and I can't stand it. So last night, while I was lying awake fretting about this, it finally hit me. I don't have to pass around all this stuff. I can just partially apply the gradient function to all the algorithm control parameters, and the raw data, and then just pass that around. So at some point I'll do this:
{-- build a gradient function for *this* problem --}
grad :: Vector Double -> Vector Double
grad = gradient kurtosis h z
Now I can just pass around grad and it will carry around all the baggage for me. I know. This is a really simple part of haskell, and it's something that is done all the time. I've certainly done it many times. But for some reason it eluded me for a really long time. The real problem was one of not thinking about this function properly. Yes, all that data is needed to calculate the gradient. But I am not calculating the gradient of all of this stuff as variables. The function, the raw data and the h distance are all fixed points within the context of the problem. That is, there is only one true variable in this application of the gradient function and that is the current position, w. All along I had been thinking about h and f and z as variables and thus I had to keep passing them around. And they are variables, within the context of the whole algorithm, but at the level of the gradient, they aren't. Huh. It took me along time to bend my brain around to seeing this the right way.
Anyway, there it is. Now that school is back in, I get stolen moments on the bus to work on this, so hopefully the rest will come soon.