Wednesday, September 28, 2022

optimization 3 - non-linear programming and gradient descent

Then go the "non-linear programming problems" (one more source of the acronym NLP, and with an even more meaningless use of the word "programming"). Here the non-linear function itself can have multiple minimums and maximums, plus there still is the boundary. The boundary, by the way, can now have non-linear inequalities too, and is the more difficult part of the problem. So the problem adds the part of finding the function extremums and deciding if they are in the right direction and inside the boundaries, while the part of finding the lowest (or highest) point on the boundary still remains similar to the linear programming, only is more difficult now. Now there is no easy assumption that we're always going in the same direction downhill along the boundary, so the solution is to find all the interesting points, compare them, and find the global extremum(s). And people have figured out how to find these interesting points on the boundaries, it's called the "KKT conditions" (KKT are the initials of the three authors). It's essentially looking for where the function is going down (for minimization) when nearing the boundary, and its gradient is exactly perpendicular to the boundary. The "exactly perpendicular" gives the lowest point. Although the corners where multiple boundaries intersect represent the whole range of gradients that lies between the gradients of boundary functions intersecting at this point, and the function matching any value from this range indicates a local minimum. But this all requires analytical computations of gradients and second gradients (AKA Hessians), and then analytically solving systems of inequalities on them, and none of this is easy to do.

So here enter the methods like the gradient descent. The idea there is that if the function has no local minimums, we can find the direction in which the function grows the fastest (i.e. the gradient) and go in the opposite direction to find the global minimum (if there are multiple local minimums, we would find one of them, not necessarily the best one).  

Here an interesting question is, how far should we go on each step? If we go too far, we'll overshoot the minimum. If we go not far enough, we'll have to make a lot of steps, computing the gradient on every step, which is slow.  And here is the most interesting part: even if we take the perfect step that gets us to the minimum in the direction of the gradient (which is possible to compute analytically for the quadratic functions using the Newton's method), that generally won't be the global minimum. As it turns out, the gradients in general don't give the direction to the global minimum. It gets us to some "trough" that leads towards the minimum but usually this trough is curved. And moreover, the local minimum in the direction of gradient (i.e. if the original function represents a 3-D shape, cutting in the direction of the gradient would give us a 2-D vertical slice) won't even be at the "bottom of the trough". So then if we take the second perfect step after the first perfect step, it will go at a right angle to the first one (there is a reason and a proof to this) and overshoot the "bottom of the trough" again. So the path becomes a zigzag, with smaller and smaller steps as we near the minimum (and never quite reach it).

How about we take the direction from two steps together and continue there? As it turns out, this still won't necessarily get us to the global minimum either, if the trough curves as it usually does. But we'll end up with a longer step that will be followed by a shorter perpendicular step, ad so on. We can try to correct by adding up again, which will keep the steps in the continuation direction longer again. This is the idea behind the "momentum gradient descent", where we keep the "momentum of inertia" from the previous steps, sort of imitating a ball rolling down a hilly landscape. It still will be making shorter and shorter meaningful steps as we get closer to the minimum (as the ball slows down when it nears the low spot), but at least it auto-adapts by growing the step size as necessary at the start, building up the longer steps out of the too-short fixed ones, so we can skip the complex computations of the Newton method and just err on the short side when selecting the step size. The momentum methods still work better with the basic step size that is closer to optimum (very short basic steps cause it to "circle the drain" around the optimum a lot), but still the difference is not as tremendous as with the plain gradient descent.

No comments:

Post a Comment