Sergey Babkin on CEP and stuff: optimization 6

Writing up the mathematical derivation of the backpropagation got me thinking about what else can be done with it. So I've decided to write these ideas down before I forgot. Not all of them, or maybe none of them, might pan out, I haven't tried yet. What bothers me is that some of these ideas should give some substantial improvements, and they're not that hard to think of, so if they do give substantial improvements, someone would have already thought about them and they would be used all over the place. But I think they aren't. So either they're wrong and don't give the improvements, or maybe they are used and only I in my ignorance don't know about it, or maybe they really are new and worth trying.

First, I think I've been wrong thinking that doing a gradient descent step based on all the training data is too much work. If we're going to minimize the error from all the training cases, then the function to minimize will be the sum of errors from all the cases. And if we want to compute a gradient, the gradient of the sum is equal to the sum of gradients of individual parts. And we can apply the exact same principle to the whole backpropagation. So doing this compound step won't even require that much memory: just a single gradient value keeping the sum for each weight. We'd backpropagate as usual, and as we compute the gradients, we will add them to the sums. But not change the weights yet. Only after a whole pass of training is done, we'd use the computed gradients to change the weights once.

But why bother with it? Well, there are methods to select a larger gradient step rate that will get us closer to the goal in one step (such as ISTA with its Lipschitz coefficient that is used for the estimation). Just scale it down by the number of layers, to account for each layer trying to do its own equal share of adjustment. And then we can also use the momentum methods, such as FISTA (it stands for Fast ISTA) to accelerate further, for each neuron. If this works right, this should accelerate the training big time, at least for the ReLU activation function that doesn't mess up the gradients much. So if nobody uses it, there should be something wrong with this idea. It will be interesting to try. Maybe something does get lost when many small steps get replaced by one big cumulative step.

Potentially, a larger ISTA-like step rate would work on the small individual steps too, just scale the steps down more, by the number of training cases. However there is a catch in this approach: if each step takes us closer to the optimum of one training case, then the last training case will have the strongest effect, while the effect from the first training case will be most diluted. Adding them up equally, trying to do its step from the same initial position, evens them out. If the steps are small anyway, this difference doesn't matter, but if we're aiming for the larger steps, maybe it would. On the other hand, if all the cases are pushing the outputs in about the same direction anyway, doing many small separate steps would follow the gradient curve better than one large step (because, remember, we would see the zigzags with the large steps). And if we do many training cases, and repeat them in many passes, there would not be any relative advantages between the cases, they would be constantly trading spaces, the first one again becoming the last one (except on the last pass). So maybe that's why nobody accumulates the gradients. But even then I still think that a larger Lipschitz-based rate can be used, and maybe even the momentum can be carried between the passes.

The cumulative gradient for the last layer can also be used to estimate one more thing: how close we are to the optimum, and so whether it's time to stop. The squared error itself can also be used for this estimation but the trouble with it is that we don't know, how close to zero can it ever potentially get for a given configuration of a neural network. If there are many more input data than neurons (which is usually the case) then the fit won't be perfect, and we don't know in advance, how good could it be. The gradient in the last layer will be gradually dropping (more or less) as we're getting closer to the optimum, so it would be a good indicator of how far we still are. Together the gradient and the squared error can also be used to detect overfitting, when the error near optimum becomes too small. Well, maybe the cumulative gradient for the last layer, whose parts are generated not all from the same starting point but over one pass of the moving network, would work just as good.

Another problem with a possible solution is what if the random seeding of the weights generates two neurons in the same layer that are very close to each other, or even identical? They will try to stay close, and together will be a waste of space, something I've seen on my very small test examples. The solution might be that we don't have to use the same step rate for all the neurons. If we use a slightly different rate for different neurons in the layer, it would gradually push them apart, and the farther apart they get, the more should be the difference in the pressures from the gradients, so this should naturally spread them in a more useful way. But each one would still be pushed towards achieving the global optimum, just at a different speed.

This got me thinking about yet another problem, of what if a neuron with ReLU becomes dead? I.e. none of the inputs produce a positive value, and after ReLU activation, the neuron always producing 0. One possibility that I didn't notice before is that the changes in the lower-level neurons might eventually resurrect the dead neuron in a natural way by changing its inputs. Except for the neurons of the level 0, these will never see the changes of inputs. So maybe it's worth waiting longer for the higher-layer neurons to become resurrected naturally before resurrecting them forcibly by reinitialization.

And it's been bothering me that I did this reinitialization by generating new random numbers, which is both unpredictable, and as an immediate effect messes up the quality of the whole network. So maybe a way to avoid the unpredictability is by adding the weights from another neuron at a fixed relative position to this one. They've been both random to start with, so now we're adding together pairs of random numbers, which should still give random numbers. If the technique above for separating the equal neurons works well, it can even be done without disturbing the quality of the network: copy the values from another network of the same layer, and split the weight corresponding to that other neuron in each neuron of the next layer into two parts (either equal, or maybe better not) and put the other part into the weight corresponding to the resurrected neuron. Initially they together will produce the same input to the next layer as the old neighboring neuron did, and then hopefully the backpropagation pressure will be pushing them further apart.

Or yet another solution to the unpredictability might be to simply change sings of all the weights in a dead neuron, then it will be firing on all the inputs! And to avoid disruption, set all its connected weights in the next layer to 0.

Yet another problem seemed to be that the gradient is proportional to either the input value (for the gradient of weight), or to the weight (for the input value), and so if one becomes 0, the other one stops changing. If both ever become 0, they get stuck there forever. But if at least one of them is not 0, it might be able to push the other one out, which would also provide a positive reinforcement to itself, so they would bootstrap up and it really isn't a problem. But maybe this doesn't bootstrap too well, or people would not be recommending to make the initial random values larger. So maybe a solution there would be to forcibly change a value that is too close to 0 to a larger one for the purpose of the gradient computation. It will overestimate the gradient, but that will either be a good thing, or the next pass (or even step) will restore equilibrium by pushing the values back to 0. And we can also keep some hysteresis, and pull the gradient artificially up only if it would move the point in the same direction as before.

A kind of long list of things to try, next I'll need to find time for it all!

Sergey Babkin on CEP and stuff

Wednesday, October 5, 2022

optimization 6 - neural networks corollaries

No comments:

Post a Comment

Links

About Me

Labels

Blog Archive