Sergey Babkin on CEP and stuff: optimization 11 - FISTA momentum descent in FloatNeuralNet

Continuing from the previous post, I've had a look at the characters in the test set that have trouble getting decoded. Well, they truly are handwritten, I've had difficulties recognizing some of them as digits at all. But a good deal of them are rather similar variations on the digit "6" that I guess are just not represented well in the training set. For many of these digits the even highest recognized class has the value deep in the negative zone, which means that they're not like anything in training. I think there should be some way to adapt the training set, such as expand it by including the digits shifted left and right by a pixel or two. Need some more thinking on that.

I've looked at the training set and counted how many training cases have their best guess negative (even if it's the correct one), and it came out at about 16.5%. I've tried to shift the training weights to prioritize cases with the low best guesses just as I've done with the wrong guesses, and it didn't help at all, the best I could scrape out was about an extra 0.2%. Maybe it's just because for every 1 positive output there are 9 negative outputs, and on average the values tend to be pulled down.

In the meantime, I've added the momentum descent, based on the FISTA algorithm. I've learned some tricks playing with my version of FISTA that I plan to add later, but so far just a very basic version, with only one trick included: when applying the momentum, compute the saturation by the limits, and adjust the momentum for it. So that if some dimension hits the limit, the momentum won't continue pushing it past the limit again and again but will reset on that dimension.

I've been trying to get the automatic descent rate deduction working properly before implementing the momentum but I gave up on that for now and decided to just try the momentum. It is a bit sensitive to the training rate being too high, maybe because there is a higher chance of driving into a situation where a high rate would give an overshoot. But bump the rate down, and the momentum descent works very, very, very well, doing a much faster progress. The mean square error for the training set went down from the previous-best 0.025 to 0.01. The bad news is that it didn't help with the test set at all: it again goes better up to a limit, and after that as the training set's stats get better, test set's stats get worse. So I think I really need to accept that the test set really truly contains the characters that are not well represented in the training set. And probably the convolution won't fix it either.

Given that the momentum descent works so awesomely, I don't understand why isn't everyone using it. It adds cost per pass but then it reduces more the number of passes needed to achieve the same precision, giving a large net win. Or maybe they do use it and don't tell everyone. Or maybe it doesn't work well for all the neural networks and I've just got lucky with my example. I don't know.

The way FISTA works is basically by remembering the last step and repeating it before computing the gradient at the new position and doing a new step. Then remembering the total step (repeat + new step) for the next step. It also has a little optimization in scaling the remembered last step by a factor a little under 1, and very gradually reducing this factor. This downscaling is intended to reduce the overshoots and "circling the drain" when the position gets close to the optimum. It also limits the maximum momentum, when the downscaling starts shrinking the momentum more than the current gradient accelerates it. Another little thing it does is that it skips the repeat not only on the first step (where it just has nothing to repeat) but also on the second step. For all I can tell, the reason for this skipping of momentum on the second step is that the initial point is usually random, and if the gradient approximation is good, it would put the point after the first step into a decent vicinity of the optimum. Which means that the first step will be large, and also stepping in a random direction that is nowhere close to the direction of the following steps. Thus it's better to not include this first large random step into the momentum that will keep repeating.

The mechanics of enabling the momentum mode in FloatNeuralNet are: set the option

options.momentum_ = true

And now there also are methods for reading the best-result-still-below-0 stats:

getPassAbove() returns the number of cases in the last pass where the correct and best outcome is above 0

getPassNotAbove() returns the number of cases in the last pass where the best outcome is either incorrect or not above 0

Sergey Babkin on CEP and stuff

Thursday, December 22, 2022

optimization 11 - FISTA momentum descent in FloatNeuralNet

No comments:

Post a Comment

Links

About Me

Labels

Blog Archive