Wednesday, September 28, 2022

optimization 3 - non-linear programming and gradient descent

Then go the "non-linear programming problems" (one more source of the acronym NLP, and with an even more meaningless use of the word "programming"). Here the non-linear function itself can have multiple minimums and maximums, plus there still is the boundary. The boundary, by the way, can now have non-linear inequalities too, and is the more difficult part of the problem. So the problem adds the part of finding the function extremums and deciding if they are in the right direction and inside the boundaries, while the part of finding the lowest (or highest) point on the boundary still remains similar to the linear programming, only is more difficult now. Now there is no easy assumption that we're always going in the same direction downhill along the boundary, so the solution is to find all the interesting points, compare them, and find the global extremum(s). And people have figured out how to find these interesting points on the boundaries, it's called the "KKT conditions" (KKT are the initials of the three authors). It's essentially looking for where the function is going down (for minimization) when nearing the boundary, and its gradient is exactly perpendicular to the boundary. The "exactly perpendicular" gives the lowest point. Although the corners where multiple boundaries intersect represent the whole range of gradients that lies between the gradients of boundary functions intersecting at this point, and the function matching any value from this range indicates a local minimum. But this all requires analytical computations of gradients and second gradients (AKA Hessians), and then analytically solving systems of inequalities on them, and none of this is easy to do.

So here enter the methods like the gradient descent. The idea there is that if the function has no local minimums, we can find the direction in which the function grows the fastest (i.e. the gradient) and go in the opposite direction to find the global minimum (if there are multiple local minimums, we would find one of them, not necessarily the best one).  

Here an interesting question is, how far should we go on each step? If we go too far, we'll overshoot the minimum. If we go not far enough, we'll have to make a lot of steps, computing the gradient on every step, which is slow.  And here is the most interesting part: even if we take the perfect step that gets us to the minimum in the direction of the gradient (which is possible to compute analytically for the quadratic functions using the Newton's method), that generally won't be the global minimum. As it turns out, the gradients in general don't give the direction to the global minimum. It gets us to some "trough" that leads towards the minimum but usually this trough is curved. And moreover, the local minimum in the direction of gradient (i.e. if the original function represents a 3-D shape, cutting in the direction of the gradient would give us a 2-D vertical slice) won't even be at the "bottom of the trough". So then if we take the second perfect step after the first perfect step, it will go at a right angle to the first one (there is a reason and a proof to this) and overshoot the "bottom of the trough" again. So the path becomes a zigzag, with smaller and smaller steps as we near the minimum (and never quite reach it).

How about we take the direction from two steps together and continue there? As it turns out, this still won't necessarily get us to the global minimum either, if the trough curves as it usually does. But we'll end up with a longer step that will be followed by a shorter perpendicular step, ad so on. We can try to correct by adding up again, which will keep the steps in the continuation direction longer again. This is the idea behind the "momentum gradient descent", where we keep the "momentum of inertia" from the previous steps, sort of imitating a ball rolling down a hilly landscape. It still will be making shorter and shorter meaningful steps as we get closer to the minimum (as the ball slows down when it nears the low spot), but at least it auto-adapts by growing the step size as necessary at the start, building up the longer steps out of the too-short fixed ones, so we can skip the complex computations of the Newton method and just err on the short side when selecting the step size. The momentum methods still work better with the basic step size that is closer to optimum (very short basic steps cause it to "circle the drain" around the optimum a lot), but still the difference is not as tremendous as with the plain gradient descent.

optimization 2 - solving the linear programming

The classic way of solving the linear programming problems is the Simplex Algorithm. It basically starts with an arbitrary corner (a tricky part is to find any starting corner) and then goes ("pivots") from there to the next neighboring corner that is lower, and so on, until it finds the lowest corner. Only it's all done as computations on vectors and matrices. It also turns out that as we get more than 2 dimensions to intersect with the plane of the function, there can be multiple corners at the same level that is not optimal, while the whole shape still is convex. So it may take the algorithm some time to sort out though these equal corners before finding one that las a connection to a corner going further down, or it may even get stuck going in circles between the same corners (but it will know that it's not at an optimum yet). For some reason they didn't talk about the obvious solution for going in circles: remember, which corners have been already tried at the current "plateau" and don't try them the second time, try the different ones (and yes, you can see, which corners are part of the plateau, they will have the key metric equal to 0).

And as it turns out, each LP problem has a related "dual" problem that it can be mechanically converted to (and it's symmetric, the second conversion, the "dual of the dual" will give the original "primal" problem). Mathematically, the dual problem is produced essentially by transposing the matrix that represents the set of inequalities, "turning it sideways", so that each variable in the primal problem is converted to an inequality in the dual one, and each inequality in the primal problem to a variable in the dual one. Logically, the relation between the problems is like this: A typical (um, "standard form" if I remember right) LP problem can also be represented as a graph of pipelines (or trade routes) of certain capacity, with the question of how to maximize the flow between two given points by using all the possible intermediate paths to the maximum of capacity - how much do we sent through each pipe? The dual problem for it would be, if we want to cut the connection between these two points completely by blocking the pipes with least possible total capacity, which pipes do we block?

Both sides of the problem are about finding the bottleneck, the primal about using it to the max, the dual about blocking it, so if you find the bottleneck in one way, it gives you the solution for the other way too. And it turns out that the dual version may be easier to solve: the major difficulty for Simplex Algorithm is finding any valid corner as a starting point, and the conversion from primal to dual provides this starting point for the dual problem. Hence the Dual Simplex Algorithm.

Saturday, September 24, 2022

optimization 1 - linear programming

This year my daughter was taking a college class on optimization, so I've had a look too. No, it's not about the software optimization, it's about optimization as a math discipline. And I've learned a good deal of new things that I want to share (and also write down a summary for myself, before I forgot too much of it). The reason why you'd want to care about it is that it has an application to the neural networks, the gradient descent algorithm used for their training is one of the optimization methods, and there are interesting consequences to think about. I'll get to that part eventually (it will take multiple posts), but first things first.

They've started with the "linear programming" problems. I've been thinking "oh, I know this stuff, it's all recursion + memoization". But no, it turns out that they use this name for something completely different. In programming we call "linear programming" the discrete problems where we need to find a best-fitting combination of the discrete units. Not much programming but at least some in all that recursive trying. What they do in optimization is they call "linear programming" the analog problems where the expressions are linear.

So you have a number of variables - call them x, y, z, ..., or x_1, x_2, x_3, ..., or a vector x (this subject is ripe on vectors and matrices). And you have a function on them that you want to maximize or minimize. A linear function (that's the source of "linear" in "linear programming") would be:

  c_1 * x_1 + c_2 * c_2 + c_3 * x_3 + ...

Well, the solution would be easy, to minimize it, send all the values of x to minus infinity (or if some c < 0 then those x to plus infinity), right? But the problems also come with conditions on x, given as inequalities, which in the "linear programming" are also linear expressions. A simple example would be

  x_1 > 0

or a more complex example

  3 * x_1 - 5 * x_2 < 5

So instead of going to infinity, we'll hit the boundaries of conditions, and the problem is pretty much about which boundaries will be most restrictive.

In visual terms, the conditions form a polyhedral shape (which might be enclosed or open on some side). In a 2-dimensional form (with only x_1 and x_2) we can represent it as a 2-dimensional fence, and then the function will be some inclined plane. If we build a model of this out of plywood and let a ball roll on the plywood sheet that represents the inclined plane of the function, the ball would stop against the fence at the lowest point and solve the minimization problem (BTW, the minimization and maximization problems can be turned into each other by simply changing the signs on everything; including turning around the "less" and "greater" signs on the boundaries). There would always be a corner (or maybe a whole straight line) that is at the lowest level. Or if the fence happens to be open at the lowest point, the ball would roll out to "minus infinity", that's called an "unbounded problem". It could also be that the conditions contradict each other, and when we try to bulid the fence, we could not do it - that's a "problem with no solution". But since each condition represents an infinitely long straight line, the fenced area cut off by all these lines is guaranteed to be convex (a "convex polyhedral"), with no strangely-shaped nooks and crannies.

In the full multi-dimensional form the conditions represent the planes in that multidimensional space, that cut and split that space, and form this multidimensional "fence". This fence would still be a convex polyhedral, and the solution would be where the condition plane intersects this "fence" at the lowest point in the condition dimension.

For solving this, they have two slightly different "standard" forms of presenting these equations, one is called the "standard form" and another the "canonical form". I guess, two different people approached these problems differently, so now there are forms for both approaches, but the problems can be converted easily between these forms.