Sunday, December 22, 2024

LASSO-ISTA-FISTA

This is a follow-up to the previous post, talking about the details of the optimization algorithms (in the math sense, i.e. trying to find a minimum of a function by gradient descent) LASSO-ISTA-FISTA that we've modified for realSEUDO. To remind, FISTA is a momentum gradient descent on top of ISTA which is the logic for selecting the step size for the simple gradient descent on top of LASSO which is the idea of adding a bias to the function being optimized to drive the dimensions with small values towards 0. The amount of this bias is adjusted by the parameter lambda.

LASSO minimizes a function like 

||f(x) -y||^2 + lambda*|x|

Here x is a vector (you can think of it as coordinates in an n-dimensional space), in our case of analyzing the brain cell videos it's the vector of activation levels for all the cells and and all the SEUDO's gaussian kernels. y is also a vector (of a different size) that represents the brightness of all the pixels in an image (an image can be also thought of as a matrix but here it doesn't matter, we just string out all the rows into a vector). f(x) is a vector-to-vector function that maps the activation levels to the image in pixels, in our case multiplying the pixel values in the dictionary image of each cells by its activation level and then adding up the brightness of each pixel from all the sources. So f(x)-y is the error between the actual image and our match, and also a vector. Then the operation ||_|| converts the error vector to a single number by taking its euclidean length (AKA norm2) in n-dimensional space (with the n here being the size of vector y). Taking its square lets us skip the square root of the euclidean computation, since both with and without the square root the minimum is at the same value of x. The bias is the lambda*|x| part. |x| is the norm1 of the vector, computed as the sum of absolute values of all the elements, and lambda adjusts the strength of its effects.

In case if you wonder, what these functions look like, here is a simple example. It's a smooth function with creases added along the axes by the bias function. The effect of the bias is that the gradient in every dimension gets lambda added to it on the positive side and subtracted on the negative side. And since the minimum is at zero gradient, this makes the dimensions with shallow gradients around 0 (which means that they don't matter much) to be pushed to 0.

The idea of LASSO is to interleave the descent steps: one according to the gradient of main function, another one according to the gradient of the bias function, which still requires computing the gradient of the main function at the new point. Which not only doubles the number of heaviest steps (gradient computation) but also tends to dither around the minimum, taking steps in different directions according to the tug-of-war of two gradients. Well, this is stupid, so why not just add two gradients together to start with? This worked very well, but turned out to be not new: someone beat us to it not that many years ago. 

By the way, in our case we're only looking for non-negative values, so we just cut off the negative side, looking at only one n-dimensional quadrant, and then the gradient bias will be simply lambda in each dimension, makes the computation a tiny bit cheaper, and also a bit easier to reason about, since the creases get cut off and the function becomes smooth in all the acceptable range.

The goal of ISTA is to select the gradient step that is as big as possible but doesn't overshoot the minimum. The safe step size depends on the maximum steepness, i.e. maximum second derivative of the function that can be encountered in stepping towards minimum (which basically means between the starting point and minimum, unless we overshoot), known as Lipschitz constant L, the step size is its reciprocal 1/L. This limit comes basically from Newton's method: we want to get to the point where the first derivative is zero, so how far can we step to not overstep that point? 1/L. In fact, if the second derivative is constant, this step would bring us exactly to the minimum.

The momentum descent in FISTA makes the good estimation of basic step less important but still plenty important, as the big steps grow the momentum much faster. TFOCS (the Matlab library for gradient descent) does a periodic sampling of the current steepness by looking at the change of gradient during the step. But TFOCS is a general library, could we do better for our specific kind of function? We've looked into the static estimation of the magnitude of the second derivative, and it worked worse then TFOCS. But out of the ways of estimation, one that worked better, was to take a separate derivative in each dimension (i.e. for dimension m the sum of dx_i/dx_m for all i) and use it separately for that dimension. This is something that only an engineer would do and a mathematician won't: try something and see if it works. And this something is "compute and use a separate L in every dimension", so that in stepping the gradient gets distorted according to these partial values of L, slowing down the stepping in the dimensions with steep second derivative and speeding up in the dimensions that are nearly flat. Now, as I've come back to it, and looked at it again analytically, my first reaction was "WTF, this can't make things better!" but it did. I even went back to the code and tried again, and it still did.

The thing is, as you get more than two dimensions, the negative terms with x_i*x_j kick in and the function stops being consistently convex in all dimensions, the gradient doesn't point exactly towards the minimum any more, and the second derivative along the direction of gradient stops being constant. So the static estimation worked worse than the sampling done by TFOCS, and we  switched to the sampling too. But then how about applying the idea of separate L by dimension to the sampling? As it turns out, it still worked well, although not always. In all the early examples we tried, it consistently worked better than TFOCS but then we've encountered more examples where it worked a little worse too. It all depends, and maybe there is a way to pick and choose, which way to use for a specific case.

And then the idea that we can treat the dimensions separately came handy for solving the overshooting problem with the momentum descent, by stopping the momentum in the dimensions that either have the gradient change sign or go out of acceptable range. This worked very, very well. And didn't leave much of FISTA as it is: momentum descent with brakes. The gradually growing brakes got replaced with the sudden stopping by separate dimensions.

All these ideas can be applied to any momentum descent, including TFOCS and other libraries. Whether they would work well, depends on the particular functions being minimized but that's OK, that's why TFOCS already has multiple options to be used in different situations. Although the fact that sudden stopping seems to work decently well even with such jagged functions as in the NNs suggests that it's likely to work well in many if not all cases.

No comments:

Post a Comment