Saturday, February 18, 2023

arrested bouncing in FloatNeuralNet

It's been a while since I wrote any updates about my experiments with optimization of neural networks. Partially, it's because I've been busy with other things, and partially because I've tried something a few weeks ago, it didn't work out well, and I wasn't sure what to do with the results. On one hand, it didn't work out, so carrying this complexity into all the future code would be annoying, and even gating this code behind an option would create unpleasant effects on the performance. On the other hand, I think these experiments were really interesting, and maybe I'd want to do another variation of them in the future, so I didn't want to just discard the code (I've already discarded some code  in the past only to re-create it later). So I've been stuck thinking what to do with it, and I couldn't do the other experiments in the meantime.

I think I have a good solution for this now: the "mini-branches". I've copied the original and modified code, and the description into cpp/nn/xperim, one subdirectory per "mini-branch", where it can be found, and diff-ed, and reapplied if needed. 

The mini-branch for those last experiments is https://sourceforge.net/p/triceps/code/1803/tree/trunk/cpp/nn/xperim/001802-bounce/ (1802 is the root revision for this mini-branch in SVN on SourceForge).

Both of these experiments center around the momentum descent. To recap on the previous state, I've been using the momentum descent loosely based on FISTA, with some extensions. One of these extensions is that whenever the gradient changes sign in some dimension, I kill the momentum in this dimension. Another is that I use these gradient sign changes as an indicator to auto-detect the safe basic gradient descent rate: I keep growing this rate until many (for some definition of "many" that I've been trying to tune with various experiments) dimensions start flipping the gradient signs after each step, then I back it off, then continue growing again.

The premise of the first experiment started with me noticing that I've unintentionally changed some logic from FISTA. FISTA has an interesting computation for the coefficient "nu" that determines how much of the momentum is used in each step. It starts with not having any momentum to apply on the first step, then on the second step it skips the momentum too, then it bumps nu to 1, and then gradually reduces it to "apply the brakes" and avoid circling too much around the optimum. But killing the momentum on the gradient sign change serves as an alternative application of the brakes.So I've removed that gradual reduction of nu, and things worked better. 

But I've noticed that I didn't handle the second step right. It wasn't keeping the momentum at 0. My guess has been that this was done to avoid including the large first step in random direction from the random initial point into the momentum. So I've set to fix it. But it made things worse, the error reduction had slowed down. Evidently, the first step is not so random after all. Or maybe my momentum-killing code is very good at eliminating the effects from the truly random first steps, and the remaining ones are useful.

The premise of the second experiment was to detect the situation when some dimension has the sign of its gradient bouncing back and forth on each step without reducing much in absolute value. This is what happens when the model starts tearing itself apart due to the descent step being too large: it keeps overshooting the optimum by a larger and larger value. It could also happen because of the momentum, but the momentum already gets reset to zero in my code when the gradient sign changes, so when we detect a gradient dimension change signs twice in a row, and get to more than 0.75 of the original value, that's because of the large descent rate. So how about we make only half the originally intended step in this dimension? It should land us much closer to the optimum. That's what I've done in this experiment.

And it didn't work out either, the convergence became slower. The potential explanation for this effect is the following: Consider a trough with a ball rolling down it. The trough is not exactly straight, and the ball isn't started exactly straight, so as it rolls down the trough, it keeps wobbling in the trough left and right, overshooting the center one way then the other way. As long as the ball doesn't go into resonance and starts wobbling more and more (eventually falling out of the trough), this wobble is actually not such a big deal. But when I dampen the wobble, it also dampens the speed of rolling down the trough, and since the momentum also gets killed on every wobble, the whole descent slows down. 

So it's something to think about for the future. It probably means that the speed could be improved by not killing the momentum on every change of the gradient sign, but only when the absolute change of the gradient becomes too large. But I'm not sure yet, how can this "too large" be defined reliably.

It likely also means that going to the individual descent rate coefficients for each dimension (something that I contemplated from both my own experiments and from reading the papers) might not work so well, because it's likely to cause the same kind of slow-down. We'll see, I'm now unblocked to try more experiments.

not MNIST

 I've been reading a little more about what other people do with the MNIST recognition. Apparently, doing the transformations on the training set is fairly common, so maybe a simple linear stretching and shrinking to multiply the number of examples would be good enough.

But the most interesting discovery has been that the dataset I've been playing with is not MNIST. Surprise, surprise. :-)

 The MNIST data set is much larger, and also has a higher resolution. It might be the older NIST dataset (without the "M" that stands for "Modified"), or maybe not even that. For some reason I've thought that the MNIST set comes from the ZIP codes recognition, so that's what I was looking for, the ZIP codes dataset, and apparently I've thought wrong. An interesting thing about the original NIST dataset is that it had the training set and the test set collected from different demographies (one from schoolchildren, another from employees of a government agency), so I guess if this is what I've got, it would explain why the sets don't represent each other so well. That was apparently a known major complaint with the older dataset that got straightened in the new modified one.