Sunday, December 25, 2022

optimization 12 - adaptable methods of neural network training

 In the last post I've been wondering, why aren't the momentum descent methods widely used for the neural network training. And guess what, a quick search has showed that they are, as well as the auto-scaling methods.

In particular, here is an interesting paper:

https://arxiv.org/pdf/1412.6980.pdf

Adam: A Method for Stochastic Optimization

What they do looks actually more like a momentum method than selecting the rate of descent. This paper has references to a few other methods:

  • AdaGrad
  • RMSProp
  • vSGD
  • AdaDelta

 And I've found this algorithm through another web  site that talks about various ML methods:

https://machinelearningmastery.com/adam-optimization-from-scratch/

Thursday, December 22, 2022

optimization 11 - FISTA momentum descent in FloatNeuralNet

Continuing from the previous post, I've had a look at the characters in the test set that have trouble getting decoded. Well, they truly are handwritten, I've had difficulties recognizing some of them as digits at all. But a good deal of them are rather similar variations on the digit "6" that I guess are just not represented well in the training set. For many of these digits the even highest recognized class has the value deep in the negative zone, which means that they're not like anything in training. I think there should be some way to adapt the training set, such as expand it by including the digits shifted left and right by a pixel or two. Need some more thinking on that.

I've looked at the training set and counted how many training cases have their best guess negative (even if it's the correct one), and it came out at about 16.5%. I've tried to shift the training weights to prioritize  cases with the low best guesses just as I've done with the wrong guesses, and it didn't help at all, the best I could scrape out was about an extra 0.2%. Maybe it's just because for every 1 positive output there are 9 negative outputs, and on average the values tend to be pulled down.

In the meantime, I've added the momentum descent, based on the FISTA algorithm. I've learned some tricks playing with my version of FISTA that I plan to add later, but so far just a very basic version, with only one trick included: when applying the momentum, compute the saturation by the limits, and adjust the momentum for it. So that if some dimension hits the limit, the momentum won't continue pushing it past the limit again and again but will reset on that dimension. 

I've been trying to get the automatic descent rate deduction working properly before implementing the momentum but I gave up on that for now and decided to just try the momentum. It is a bit sensitive to the training rate being too high, maybe because there is a higher chance of driving into a situation where a high rate would give an overshoot. But bump the rate down, and the momentum descent works very, very, very well, doing a much faster progress. The mean square error for the training set went down from the previous-best 0.025  to 0.01. The bad news is that it didn't help with the test set at all: it again goes better up to a limit, and after that as the training set's stats get better, test set's stats get worse. So I think I really need to accept that the test set really truly contains the characters that are not well represented in the training set. And probably the convolution won't fix it either.

Given that the momentum descent works so awesomely, I don't understand why isn't everyone using it. It adds cost per pass but then it reduces more the number of passes needed to achieve the same precision, giving a large net win. Or maybe they do use it and don't tell everyone.  Or maybe it doesn't work well for all the neural networks and I've just got lucky with my example. I don't know.

The way FISTA works is basically by remembering the last step and repeating it before computing the gradient at the new position and doing a new step. Then remembering the total step (repeat + new step) for the next step. It also has a little optimization in scaling the remembered last step by a factor a little under 1, and very gradually reducing this factor. This downscaling is intended to reduce the overshoots and "circling the drain" when the position gets close to the optimum. It also limits the maximum momentum, when the downscaling starts shrinking the momentum more than the current gradient accelerates it. Another little thing it does is that it skips the repeat not only on the first step (where it just has nothing to repeat) but also on the second step. For all I can tell, the reason for this skipping of momentum on the second step is that the initial point is usually random, and if the gradient approximation is good, it would put the point after the first step into a decent vicinity of the optimum. Which means that the first step will be large, and also stepping in a random direction that is nowhere close to the direction of the following steps. Thus it's better to not include this first large random step into the momentum that will keep repeating.

 The mechanics of enabling the momentum mode in FloatNeuralNet are: set the option

options.momentum_ = true 

And now there also are methods for reading the best-result-still-below-0 stats:

getPassAbove() returns the number of cases in the last pass where the correct and best outcome is above 0

getPassNotAbove() returns the number of cases in the last pass where the best outcome is either incorrect or not above 0

Tuesday, December 20, 2022

FloatNeuralNet classifier training mode

Continuing the story from the previous post, I've been experimenting with the MNIST classifier of handwritten digits from ZIP codes some more, and I've tried to improve on it by adding a special mode for gradient computation:

The point of a classifier is that it produces a result with the highest value at the output that corresponds to the correct classification, and the exact values on the outputs don't mater that much. So when training, it makes sense to check whether a particular training case already produces the correct result, and if so then "dampen" it to give the training cases that are still producing incorrect results a greater chance to shift the weights for their benefit. I've done this "dampening" by multiplying the gradients of the correct cases by a value less than 1. I've started with this value set at 1/1000 but that tended to produce rather wild swings where a small number of incorrect cases would pull the weights towards themselves only to make a large number of formerly correct cases incorrect. I've reduced it to 1/100 and that worked out quite well. I'm not sure if the selection of this multiplier depends on the number of training cases, FWIW, the MNIST set has a little over 7000 training cases. 

And to make sure that the training doesn't get too slow, I've added the auto-rescaling of the training rate, so that the potential sum of the gradients would total up the same. This made it a bit more touchy, since there is a greater chance of the gradient getting dominated by only one training case, and a smaller base rate (I've reduced it by a factor of 3) was needed to avoid the meltdown into overshoots. Maybe I'll embed this safety factor into the auto-rescaling at some point, I'm just not sure yet how to pick it for the general case.

The end result is a bit mixed. It worked very, very well for the training set, bringing the error rate (i.e. the number of cases that produce a wrong value) to zero. And the mean square error went the same or slightly better as with the plain implementation, meaning that not only the previously neglected cases are getting attention but also they get incorporated better into the common model and don't hurt the others. The mean square error on the test set also went slightly better, the best I've seen so far at 0.193 (an improvement of almost 0.01 over the plain version). But the error rate on the test set, which I hoped to improve, still stayed at around 5.5%, not really an improvement there. I've seen it go down to 5% and then grow again on the following rounds of training, while the mean square error became better. 

So for all I can tell, the test set really does contain some characters that are substantially unlike any of the characters in the training set. And probably the only way to do better is to actually look at them and figure out what is different about them and then  manipulate the training set to introduce variability by generating multiple slightly different versions of each character. I've kind of got interested in this puzzle, so maybe I'll find time to try this. Or maybe I'll try to implement the convolution first but by now I doubt that it would help much with this specific problem on its own. Maybe a general way to approach this variability would be to do some unsupervised classification of the images of the same digit, to produce the groups  of distinct ways to write it. But that looks difficult.

Another simple thing to try would be to make the classifier preference even stronger, requiring that not only the correct output is the highest but also that it beats the next highest output by a certain margin. And maybe even vary the gradient weight in a reverse proportion to this margin. But I doubt that it would solve the problem with the stubborn 5.5% of the test set.

Yet another interesting development is that the CORNER activation seems to work quite well. I've added the third layer with it, and it produced better results. My previous impression that the deeper networks train slower was based on a limited number of training passes, the larger models are slower there, but as the pass count goes into multiple thousands, more neurons do help to squeeze out a better quality. Of course, each pass is slower with more neurons, but as far as the progress per pass is concerned, the larger networks do better on the later rounds.

Where the new feature is in the code: 

https://sourceforge.net/p/triceps/code/1779/tree/

Setting the FloatNeuralnet option isClassifier_ to true enables this new classifier training mode. The option correctMultiplier_ defines the penalty gradient multiplier for the cases that are already correct (the default of 0.001 there is a bit too much of a penalty, 0.01 looks better). And since the correct and incorrect cases are counted for re-scaling the training rate appropriately, these counters are now also available externally too via methods:

size_t getPassCorrect() const;
size_t getPassIncorrect() const;

This allows to compute the error rate easily from the results of a training pass. They get computed only if the classifier training mode is enabled, and they show the error rate as it was before the last training pass, it will change after applying the gradient. The classifier mode can be used with both the piecemeal steps of applying the training rate in train() method and with applyGradient(), but only applyGradient() is able to do the automatic scaling of the training rate to compensate for the shrinkage.

 
 

Sunday, December 18, 2022

FloatNeuralNet on the MNIST set, and checkpointing

 Since I've been experimenting with neural nets on an "unnatural" example, I've been wondering, how would the things I've tried work on a more natural one. So I've made a demo for the MNIST example of recognizing the handwritten digits from ZIP codes. It's implemented in the branch https://sourceforge.net/p/triceps/code/1776/tree/ , in the file nn/demo/d_FnnZip.cpp.

The code looks for the MNIST files that can be downloaded from https://hastie.su.domains/StatLearnSparsity_files/DATA/zipcode.html placed in any ancestor directory's subdirectory "zipnn" (so I usually put it next to Triceps's top directory). 

The original files are scans with 16x16 dots per digit but it has turned out to be rather slow in processing, so I've made the demo downscale them to 8x8 before processing. 

Another thing that was rather disappointing was that I didn't save the state of a trained NN before. It's OK for small and quick training but with a slower and longer one, the results really need to be preserved. So I've added the FloatNeuralNet methods checkpoint() to dump the state to a file and uncheckpoint() to get it back. It can also be used creatively to do some training with some settings and then load it into an instance with different settings to see if something would improve.

With 8x8, it's 64 inputs, and 10 outputs for 10 recognized digits, and as a baseline I've used 2 inner layers of 64 neurons each: 64->64->64->10. But I've experimented with more and fewer neurons, and more layers.

So, what are the observed results?

First, the RELU variations work quite decently, probably because the original data is already centered on the [-1, 1] range. 

The larger models are slower to train, nor just in execution time per pass, but also each pass makes a smaller improvement.

The web page talks that it's a hard problem and the error rate on the test set at 2.5% is very good. Well, the best I could do so far is about 5%. The problem with this data set is that the test set differs a good deal from the training set. So when the training set gets to a mean square error of 0.07 and the error rate of 0.5% (and keeps improving),  the test set gets stuck at a mean square error of 0.2 and error rate of about 5%. Worse yet, as the error rate on the training set keeps getting a little better, on the test set it gets a little worse. Which means that the NN makes a close fit for the training data, it doesn't fit the test data. So I've tried to reduce the number of neurons per layer to 32, and the result got a little worse overall, and it still does this divergence between training and test data.

The CORNER activation showed not spectacularly but decently. Since it has a "mini-layer" on top of each normal layer, I was able to remove one layer to sizes 64->64->10, and it did just a little better than LEAKY_RELY in 64->64->64->10, and each pass got computed faster. In fact, the best seen error rate of 5%  came with CORNER at about 10K passes and then deteriorated a little to about 5.3% at 20K passes, while the best I've seen with LEAKY_RELU is 5.5%. But those small differences may also be due to randomness. I've used the same starting point in the random number generation but with the different number of weights it plays out a bit different. Adding an extra layer with CORNER brings the training mean square error to below 0.05 and error rate down to below 0.3%, and even the test mean square error goes below 0.2 (which it didn't do in other combinations), but the test error rate stays at about 5.6%, so it's improving the fit but still overfitting worse than without the extra layer.

With the 32-neuron middle layers, the test error rate gets a bit worse, around 6.6%  after 20K passes of training. 

I've thought that maybe one of the sources of trouble is that with the inputs having absolute values less than 1, and with all the weights less than 1, it might be hard to produce the outputs of +-1. So I've tried to set the expected outputs to +-0.1, and it really made no difference. Increasing the maximum weight limits to +-100 didn't make a whole lot of a difference either (but made things slightly slower to train). Interestingly, taking a trained model with limits of +-100 and trying to continue training it with limits of +-1 produces a quite bad result that has a hard time digging out of the suboptimal point.

Applying the changes after each training case vs at the end of the pass didn't make a noticeable difference either. I'm thinking that applying at the end of the pass should provide an obvious way for parallelization: compute the partitioned set of training cases in multiple threads, each one summing up its own copy of the gradients, and then sum the partial per-thread results together.

Using the "floor" option that makes the sign-flipping of the weights more "structured" (on the first push it stops at the same sign but a small non-0 value, on the second push it switches the sign of that small value, and on the following pushes it would grow with the switched sign) made the training a little slower.

The automatic training rate estimation didn't work well at all and I don't know yet how to fix it.

What things to try next?

One obvious thing to combat overfitting would be to try some batching, and it should make the processing faster too.

Another idea I've got is that for a classifier that chooses only one category (the one with highest value), it's more important that the right neuron gets the highest output than to hit the target values exactly. So maybe taking a page from Adaboost's book, the training cases that produce the right highest output should have their gradients reduced (since they are "already right"), to give the training cases that are still wrong a higher relative weight for the next move.

And the next idea yet would be to try the convolution, which would be another step in adding complexity.

P.S. I've tried the batching. It made the error rate worse, and the larger the batch size the worse it gets. A kind of good news is that it worsened the training error rate more than the test error rate, so it definitely works against the overfitting, at least kind of and somewhat. Well, the training error rate is still lower than the test one, but with a lower relation between them.