Monday, October 3, 2022

optimization 4 - backpropagation for the last layer

As you might have guessed already from discussion of gradient descent, there is a direct relation between non-linear optimization and backpropagation. Previously I've pointed to https://www.researchgate.net/publication/266396438_A_Gentle_Introduction_to_Backpropagation as a good intro reading. Having the background optimization and looking at the neural network training problem as an optimization problem can explain, where does all this logic come from. Here is this explanation:

First, about the variable names. Traditionally, the linear regression stage in a neuron is described as:

output = sum(x[i] * w[i])

where I'll use the programming-like square-bracket indexes instead of the math-like subscript indexes; x is a vector of inputs and w is a vector of weights. By the way, what we call in programming the indexes or elements of a vector, in mathematics they call dimensions, sometimes not differentiating between the index and element, and using the word "dimension" for both. I'll use this word sometimes too. 

The thing though is that when doing the training, in this naming x is a constant and w is our variable that we're trying to optimize to achieve the training. Mathematically, it doesn't matter which letter is used for what, but I personally am used to x being a variable, so seeing it as a constant confuses me and makes me make more errors. So I'll use a different assignment of the names:

output = sum(v[i] * x[i])

where v is a vector of inputs and x is a vector of weights.

And let's ignore the non-linear activation function after the linear regression at the output side of the neuron for the moment. My experience shows that you can't use the ReLu function at the output of the last layer anyway, because the negative side won't backpropagate properly, and the Gentle Introduction also says that the activation function is not used after the last layer (and before this post is over, you'll see why). Thus let's start the consideration from the last layer that doesn't have this extra complexity.

We want to train the neuron to minimize the squared error on all the examples. I.e. if we have a set of N training examples, so that j-th example is (v[j], b[j]) where v is the vector of inputs and b is the expected output, then we want to minimize

min(f(x)),
f(x) =  sum{1 <= j <= N}( (ouput(v[j], x) - b[j])^2 

and if we expand the ouput() function and assume that the neuron has K inputs, that expression becomes:

f(x) = sum{1 <= j <= N}( (sum{1 <= i <= K}(v[j][i]*x[i]) - b[j])^2 )

If we ignore any boundaries, minimization would require just finding the values of x where the function is minimal, and since the function is quadratic and the shape of the curve will be a multidimensional parabola, there will be exactly one minimum that will be the global minimum. At this minimum the gradient (i.e. the multi-dimensional derivative) will be equal to 0, so it's even possible to find the minimum analytically: each dimension of the gradient will be linear (because the original function is quadratic, and a derivative loses one power), so equaling it to 0 will produce a system of linear equations that can be solved. 

Each dimension k of the gradient (the proper name for it is the "partial derivative of f() by x[k]") will be:

df/dx[k] = sum{1 <= j <= N}( 2 * v[j][k] * (sum{1 <= i <= K}(v[j][i]*x[i]) - b[j] )) 

where j is the index of the  training example, i is the index of the input of the neuron, and k is the index of the gradient's dimension (which has the same range 1 <= k <= K as the index of inputs).

An exact solution of a system of linear equations might be not so bad complexity-wise for a small number of inputs, but can get much worse for a large number of inputs. In that case it can be approximated by a gradient descent method, and with momentum methods it can be not so bad, say 100 steps of descent, each of quadratic time complexity, to get a decent approximation. The trouble though is that gives the perfect solution for only one neuron of the last layer, and then we'll do the backpropagation to the previous layers, and the inputs v will shift, and we'll need to do the same thing again. And also while we compute it, we'll need to remember the outputs of all the layers for all the examples to do the backpropagation. So when looking at the whole training problem, the complexity balloons very fast.

Instead people have come up with an inexact approximation. They said, let's do one training case at a time, it will give us some gradient step, the next case will give some other gradient step, and after all the cases are done, we'll get somewhere reasonably close to where we'd have been with the perfect soluton. And a small difference won't matter much because the inputs v will be shifting on the next step anyway. And we'll need to keep much less state at a time, just the outpus of all the layers for a single case, so every which way it's much cheaper. Because of these arbitrary wanderings in the middle of each training pass, they call it the stochastic gradient descent.

So OK, since we're looking at only one training case now, let's drop the index j, and the functions become:

f(x) = (sum{1 <= i <= K}(v[i]*x[i]) - b)^2 

df/dx[k] = 2 * v[k] * (sum{1 <= i <= K}(v[i]*x[i]) - b) )

I personally find the more concrete examples easier to understand and compute, so here is how I came up with the formulas above: I started with an example of a neuron with 3 inputs (i.e. K=3). Then the specific formula will be:

f(x) = (v[1]*x[i] + v[2]*x[2] + v[3]*x[3] - b)^2

To compute the gradient dimension, we'd need to expand this square, but we can note than in taking the derivative each term that doesn't contain x[k] will disappear, so we need only to bother with the terms that contain at least one x[k]. So for example for k=1 we can use the function approximation

f(x) = v[1]^2*x[1]^2 + 2*v[2]*x[2]*v[1]*x[1] + 2*v[3]*x[3]*v[1]*x[1] + (...)

df/dx[1] = 2*v[1]^2*x[1] + 2*v[2]*x[2]*v[1] + 2*v[3]*x[3]*v[1] =
= 2*v[1] * (v[1]*x[1] + v[2]*x[2] + v[3]*x[3])

And the same symmetrically for the other dimensions of x.

Alright, we've got the gradient. We can do one more simplification. Note that the sum inside it is equal to the output of the linear regression:

output = sum{1 <= i <= K}(v[i]*x[i]) 

So we can use it to compute the gradient:

df/dx[k] = 2 * v[k] * (output - b) )

After the gradient is computed, applying it is trivial. In the FloatNeuralNet code https://sourceforge.net/p/triceps/code/1707/tree/trunk/cpp/nn/FloatNeuralNet.cpp the gradient values are computed in two steps, first:

    s_cur[i] = 2 * (out - outputs[i]);

then in the next loop s_cur gets moved into s_next and

    Value gradient = input * s_next[i];
    Value newwt = saturate(weights_[weightpos] - rate * gradient);

Another thing to note is that this computation of gradients is valid only for the simple quadratic formula. If we were to insert an activation function after the linear regression, all the math will change, and I guess computing the gradients with included sigmoid is too much work.

And the third thing to note is that batching as it's normally done is not equivalent to  computing the minimum on the whole training set. Batching combines the training examples, and hence the errors produced on them, linearly. The minimization on the whole set combines the errors quadratically. Although I guess nothing really prevents us from doing a small number of training examples separately and then combining them quadratically. It would be a quadratic batching. But I'm not sure if it's worth the trouble. If nobody does it, it probably isn't.

By now this post is growing too large, so I'll talk about the further backpropagation in the next post.

No comments:

Post a Comment