In the last post I've talked about how the computation with the V-shaped ("corner") activation function sometimes diverges badly. It turns out, the activation function as such is not the reason.
Another change I've done at the same time was to remove the limit on the weight values (or practically speaking, changed it from +-1 to +-1e10). That was the reason for divergence. Set the limit back to 1, and the divergence disappears. On the surface, that limit of 1 was arbitrary, so if we want to model arbitrary functions, it made sense to remove this limit (and in particular, to do a XOR in a single layer, I needed the values equal to 2). But unfortunately there is a connection between the weights (i.e. multipliers) and the rate of descent that can be used. Here is how it works.
There is a way to evaluate, what rate of descent is safe, it's used in the ISTA algorithm and uses the Lipschitz constant L. The safe rate of descent (that doesn't cause us to overshoot the minimum and diverge) is 1/L. L is defined as follows:
L = max( || grad_f(x) - grad_f(y) || / || x - y || )
What is what in this expression:
x and y are two arbitrary points in the multidimensional space of weights being optimized. I.e. each one of them is a vector (i.e. an array in programming terms) of all the weights.
grad_f() is the gradient (i.e. the multi-dimensional derivative) of the function being optimized. It gets computed at the argument values x and y. Note that even though the function produces a single value, the gradient is also an array, of the same size as the function argument (because it contains a partial derivative for each input dimension).
The double-vertical-line "parentheses" mean the norm operation, more exactly the second norm. It's something that I either haven't seen or didn't remember before from college but the norm operation is widely used in optimization. It's meaning is computing the length of the vector in the physical sense. There is this difference between math (and programming) and physics: in math (and programming) the length of the vector is the number of the dimensions in it, in physics it's the length of the arrow that goes from the origin to the point determined by the values of the vector as coordinates in a multidimensional space. So if you want to talk about this "physical length" in math, you'd get confused between two lengths, and thus they use a different word "norm" for it.
There is also more than one kind of norms. In general they are written with the index, like ||x||_2, but if the index is not specified, 2 is assumed. The first norm, that also could be written as |x|, represents the distance as it's walked "in steps" along the coordinate axes. Yes, it's no accident that it's written in the same way as the absolute value: for a single dimension, the absolute value is the distance we walk form the origin to the value on a single axis. So norm 1 is the generalization of an absolute value for a vector. For another example, if we have a vector [4, -3], then we'd walk 4 steps along one dimension and 3 steps long another dimension, and norm 1 will be 7. It can be written as:
|x| = sum( abs(x[i]) )
The second norm uses squares and represents the distance in Euclidean space:
||x|| = sqrt( sum( x[i]^2 ) )
So for a vector [4, -3], norm 2 will be sqrt(4^2 + (-3)^2) = 5. Apparently there are higher norms too but I didn't look at them.
Returning to the formula for L, with all the parts parsed, the meaning of it is "if we do a step from any point x to any other point y, what will be the maximum change of gradients between these points, in relation to the length of the step taken?" This is important for the gradient descent because there we take the steps proportional to the gradient. So we make a step that is based on the gradient, and after the step we get another gradient. And we don't want to overstep the minimum, which means we don't want to go beyond the point where the gradient is 0. So if L is large, meaning that a small step can cause a large change in gradient, we should be taking only small steps, and if L is small, meaning that nothing much changes until a large step, we can take the large steps. Hence the safe rate of descent is 1/L.
For the quadratic polynomial functions, L can be computed as a constant. The expression for L divides the difference in gradient by the difference in x, which is essentially the second derivative, which is constant for the quadratic polynomial functions. But it's a derivative computed from the worst possible point in the worst possible direction, so this constant may underestimate the allowed steps. Consider a simple function
f = 1000 * x[1]^2 + x[2]^2
Here the dimension x[1] with its high coefficient of 1000 is what will drive the constant L to a high value. This value of 1/L will be so good that it will really get us to the optimal value of x[1] in a single step, essentially applying Newton's method! (Possible for the simple functions that don't have dependencies between two variables). But then we'll have to take 1000 steps of this size in the dimension x[2] to get to its optimum. Not great but at least safe. There is a Matlab package for optimization called TFOCS, and what they do is measure the actual L from the recent steps: we've changed x[] by so much, and gradient has changed by so much, we can divide them and assume that it's a reasonable estimation of L for the part of the function where we're traveling. And hopefully we're traveling towards the optimum, where the gradients become smaller, so we can assume that L should be getting smaller too. So in this example, after TFOCS gets the dimension x[1] right, it will detect that in the following steps L becomes smaller, and it can take the larger steps.
We can do it for the more general functions too, where the second derivative is not constant but depends on X, with similar assumption that were' traveling towards the areas with smaller gradients, but they're kind of difficult in our case.
We can write out the loss function to minimize in the simplest case (for one neuron of the last layer) as
f = (x[1] * x[2] + x[3])^2
Here x[1] is one input, x[2] is one weight, and x[3] represents the whole rest of the neuron bunched together: the sum of other products of inputs and weights, and subtraction of the true result from the training case. Since I'd like to avoid dragging around the indexes, I'll rename the variables: x[1] will become x, x[2] will become y, and x[3] will become b:
f = (x*y + b)^2
Then the dimensions of its gradient will be:
df/dx = 2*y^2*x + 2*b*y = 2y(xy + b)
df/dy = [similarly] =2x(xy + b)
df/db = 2*x*y + 2*b = 2(xy + b)
Yes, b counts here as a variable too, cumulative of its internal parts. Let's do the initial computations for the square of L, then we can use the squares of the norms and not mess with roots for now. There let's use the indexes to identify the point, so the dimensions (x[0], y[0], b[0]) will be one point and (x[1], y[1], b[1]) will be the other point. Then the squared norm 2 for the points will be:
|| (x, y, b)[1] - (x, y, b)[0] || ^2 = (x[1] - x[0])^2 + (y[1] - y[0])^2 + (b[1] - b[0])^2
And the squared norm 2 for the gradients will be:
|| grad_f[1] - grad_f[0] ||^2 =
= (df/dx[1] - df/dx[0])^2 + (df/dy[1] - df/dy[0])^2 + (df/db[1] - df/df[0])^2 =
= (2 y[1] (x[1]y[1] + b[1]) - 2 y[0] (x[0]y[0] + b[0]) )^2 +
+ (2 x[1] (x[1]y[1] + b[1]) - 2 x[0] (x[0]y[0] + b[0]) )^2 +
+ (2 (x[1]y[1] + b[1]) - 2 (x[0]y[0] + b[0]) )^2
Finding or even estimating the whole maximum is so far too complicated for me but as the first approximation let's consider what happens if the point moves in one direction only. For example, it moves by x and y[1]=y[0] and b[1]=b[0]. Then the expressions become simpler:
|| (x, y, b)[1] - (x, y, b)[0] || ^2 = (x[1] - x[0])^2
|| grad_f[1] - grad_f[0] ||^2 =
= (2 y^2 (x[1] - x[0]))^2 +
+ (2 (x[1]^2 - x[0]^2) y + 2 (x[1] - x[0]) b )^2 +
+ ( 2 (x[1] - x[0]) y )^2 =
= (2 y^2 (x[1] - x[0]))^2 +
+ (2 (x[1]^2 - x[0]^2) y )^2 + 8 y b (x[1]^2 - x[0]^2) (x[1] - x[0]) + ( 2 (x[1] - x[0]) b )^2 +
+ ( 2 (x[1] - x[0]) y )^2
When we divide the squared norm of the gradient difference by the squared norm of points difference, we get:
L^2 =4 y^4 + 4 y^2 +
+ 8 y b (x[1]^2 - x[0]^2) / (x[1] - x[0]) +
+ 4 b^2 + 4 y^2
And there is a similar derivation for the other dimensions. This expression works out not so bad for |x| <= 1 and |y| <=1. In this situation we're guaranteed (x[1]^2 - x[0]^2) / (x[1] - x[0]) < 1, and all of the 2nd and 4th powers of y being < 1, so the result is limited to
L^2 <= 4 + 4 + 8 b + 4 b^2 + 4
and so
L <= 2 * sqrt(b^2 + 2 b + 3)
That's only one dimension for calculation of L, by x, but the other dimension by y is similar, and for dimension b the formula reduces to an even shorter one
L <= abs(2 * b)
So if we combine the dimensions (I'm not sure if this is the right thing to do but I guess close enough) the full L will be
L <= 2 * sqrt(3 b^2 + 4 b + 6)
This explains why in my small examples with very few neurons the training rate of 0.1 per round (divided by the number of training cases in a round) generally works well.
But if the weights are allowed to be over 1, then all of a sudden we get squares of large values, and L grows quite large, and 1/L becomes very small.
So I've tried to limit the weights to +-1, and lo and behold, the examples that used to be badly diverging, are now converging very nicely, and produce a decent approximation of a parabola with an error value quite close to a manually-built approximation.
I guess this means that much of the training speed can be gained with the right dynamic estimation of L. And I get the feeling that when we propagate the gradients backwards through the layers, and multiply them by the coefficients smaller than 1, this means that not only the gradients diminish but also the L value should be shrinking, and the learning rate of 1/L should be growing. But working this out will take more time and experimentation.