Monday, October 3, 2022

optimization 5 - backpropagation further backwards

In the previous post we've got the last layer, how do we backpropagate from it? Backpropagation fundamentally means, how do we modify the preceding layer to change the vector v? Which can be done by doing a gradient step on v. So they turn the expression around and say that now v is the variable and x is the constant (yeah, my attempt from the last post to avoid having x as a constant was ultimately in vain, but it got us through a good deal of description). Applying the symmetrical logic we get:

df/dv[k] = 2 * x[k] * (output - b) )

Note that here x[k] has the value from before it has done its own gradient step. The estimation of the gradient by v for the next step can be improved a little by using the new value of x and recomputing the new value of output. But since the step is small, it probably doesn't matter much, and computing the new output value is extra work.

In FloatNeuralNet code this happens in the line

    wsigma += weights_[weightpos] * s_next[i];

where s_next[i] was previously set to (2 * (out - outputs[i])). The addition is there because of the next part: each neuron in the inner layers affects multiple neurons in the next layer. 

To show what is going on, let's consider a small example with a neuron of a layer-before-last with 3 inputs, that feeds into two neurons of the last layer, in beautiful ASCII art:

                                        +--> *x[1][1]----V
                                        |    *x[1][2]--> sum = f1 -> -b[1]
u[1] -> *y[1]----V                      |    *x[1][3]----^
u[2] -> *y[2]--> sum = s -> act = v[1] -+
u[3] -> *y[3]----^                      |
                                        +--> *x[2][1]----V
                                             *x[2][2]--> sum = f2 -> -b[2]
                                             *x[2][3]----^

Here I've used the variable names u and y for the input and weights of the new neuron. After summation it produces the value s, and after the activation function the value v[1]. This value v[1] is then fed as the first input of the last-layer neurons. The other inputs don't matter here, so they are not shown. The weights of the last-layer neurons are still x, but now with an extra index to tell to which neuron do they belong. The results of the last-layer neurons are f1 and f2, that are compared with the training outputs b[1] and b[2].

Let's consider for now that the activation function just passes its argument through (such as ReLU for the positive values), and so v[1] = s.

The full error (AKA loss) function f that we minimize will be

f = f1 + f2

The partial derivative by v[1] (AKA the first dimension of the gradient of f) will also be a linear combination:

df/dv[1] = df1/dv[1] + df2/dv[1] =
    =
2 * x[1][1] * (output[1] - b[1]) ) + 2 * x[2][1] * (output[2] - b[2]) )

So that's where the addition comes from, the derivatives from the last-layer neurons are added up to get the derivative of the whole function that we're minimizing.

In the inner layers there are no squares on the outputs, so the derivatives are much easier to compute. If the activation function is pass-through then:

dv[1]/dy[k] = u[k]

If the activation function is not pass-through, its derivative gets included simply as a factor:

dv[1]/dy[k] = u[k] * (dv[1]/ds)

In FloatNeuralNet code this translates to the lines:

wsigma += weights_[weightpos] * s_next[i]; // in an inner loop
s_cur[prev] = der * wsigma;
// s_cur gets moved to s_next for the next iteration on the preceding layer
Value gradient = input * s_next[i];
Value newwt = saturate(weights_[weightpos] - rate * gradient);

And then it continues propagating in the same way through more and more layers.

An interesting consequence of all this is that on each layer there are two gradient descents going on: once by adjusting the weights of this layer, once by adjusting the previous layer. So if we really wanted to compute the perfect match for the weights of this layer and did it, then right after that the adjustment to the previous layers would overshoot the minimum towards the other side. So it's actually good that the descent rate value is on the small size, and it gets amplified by the number of layers:

The last layer descends by rate, and passes the gradient to the previous layer;
The next-to-last layer descends by rate
, and passes the gradient to the previous layer;
... and so on..

So the effective descent rate is really(rate*n_layers) , which means that the shallow neural nets can use a higher rate value than the deep ones. Well, I guess a counteracting force might be that on the way back the gradients might be getting more and more muddled and closer to 0, so the same rate in the earlier layers will produce a smaller effect.  Apparently, that's why in the very deep networks (over 100 layers) they make some layers double-connected, feeding both the next layer and another layer that is maybe 20 or 50 layers later, to bring in the stronger gradients. I wonder if a similar effect can be obtained by using a larger descent rate in the earlier layers.

No comments:

Post a Comment