Sunday, March 6, 2022

first attempts at training FloatNeuralNet

 After experimenting with the "intelligent design", I've tried out training the same network from a random state to do the same thing, computing the square function. Yes, it's not a typical usage, but it's a surprising example, I think automatically approximating an analog function  is quite impressive.

And it worked. Not every time, depending on the initial random state, but it when it worked, it worked quite decently. Usually worked better with more neurons. My guess is that with more neurons there is a higher chance that some of them would form a good state. Some of them end up kind of dead but if there were more to start with, it's more likely that some useful number of them will still be good.  The ReLU function I've used has this nasty habit of having a zero derivative in the lower half, and if a neuron gets there on all inputs, it becomes dead.

Which brought the question: how do we do a better initialization? Yes, I know that the bad training from a bad random initialization is normal, but I wonder, could it be done better? I've found the article: https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/.  Initialization-wise it boils down to initializing the weights between the neurons in range +-1/sqrt(num_of_inputs), and initializing the bias to 0.1. There is also an idea to consider the ReLU derivative in the negative side as a small negative value but I don't want to touch that yet. The square root probably comes from the variance of the standard distribution. The bias of 0.1 "pulls" the value up, so that even if the weights are mostly or all negative, there is still a chance of the neuron producing a positive output that would pass through ReLU. This did improve things, but only slightly, and still pretty regularly I would get a bad training.

The next thing I've noticed was that things tend to go better if I do the first few rounds (where presenting the whole set of input data is a round) of training at a higher rate. The recommendation from the article on backpropagation was to use the training rate of 1%.  I've found that if I do the first few rounds with the rate of 10%, I get fewer bad trainings. I've eventually settled down on the first 50 rounds out of 1000 using the high rate.

BTW, I've been worrying that when the weight gets initialized with a certain sign, it would always stay with the same sign, because of the ReLU having a zero derivative in the lower part. The experiment has shown that this is not the case, the weights do change signs, and even not that rarely. I still need to look at the formulas to understand, what did I miss in them.

What next? It would be good to reclaim the dead neurons and put them to a good use. Suppose we do a round of training with the whole training set. If a neuron never produces a positive value on the whole training set, it's effectively dead, and can be reinitialized by randomization. So I've done this. It does add some overhead but most of it happens infrequently, at most once per round. The only overhead to the normal training is increasing the usage counters. Observing the effects, he weird part is that it's not like "once you get all the neurons into a live state, they stay alive". Yes, the dead neurons tend to be found near the start of training but not necessarily on the consecutive rounds, and once in a while they pop up out of nowhere on a round like 300 or 500. I've added the reclamation of dead neurons after each round, and this made a big improvement. I was even able to shrink the network to the same size as I used in the "intelligent design" version and get the good results most of the time, normally better than from my handmade weights (although not as good as handmade + training). But not all the time.

Looking at the weights in the unsuccessful cases, the problem seems to be that it happens when all the neurons in this small model get initialized too similarly to each other. And then they get pushed by the training in the same direction. This seems to be the reason why the larger training rate on the early rounds made an improvement: it helps the neurons diverge. What could I do to make the neurons more different? Well, they got the fixed bias weights of 0.1. How about changing that, since I've got another way of reclaiming the dead neurons now? I've made the bias weights random, and this reduced the frequency of bad cases but didn't eliminate them.

Another thing commonly seen in the bad trainings seems to be that the weights in level 1 tend to be pushed to a low absolute value. Essentially, the NN tries to approximate the result with almost a constant, and this approximation is not a very good one. This drives the weights in the second layer up  to absolute values greater than 1, to compensate for that constness. How about if I won't let the weights get out of [-1, 1] range, will this push the lower layer to compensate there? It did, the weights there became not so small, but overall the problem of the weights driven towards 0 is not solved yet in my example. In an article I've read in IEEE Spectrum, they re-train the networks to new input data by reinitializing the neurons with low absolute weights. So maybe the same thing can be done in the normal training.

P.S. Forgot to mention, I've also changed the initialization of weights to be not too close to 0. With the upper limit on the absolute value limited by the formula with square root, I've also added the lower absolute limit at 0.1 of that.

Tuesday, March 1, 2022

developing FloatNeuralNet in Triceps

For a while I've wanted to experiment with the neural networks but haven't got around to it. I did some playing with them in a training course a few years ago, but there it was a black box into which you plug the training data, not the same thing as making your own.

Doing it in Perl, as I've previously done with Bayesian examples, would probably be slow. Finally I've realized that Triceps is not that much different from TensorFlow: both allow implementing the data flow machines, just with different processing elements. So nothing really stops me from adding a neural network processing element to Triceps. And it would be a good platform for experimentation, and at the same time potentially useful as production code.

My first idea was to implement the neural network as short (16-bit or even 8-bit) integers, since apparently the people say that it's good enough precision.  In reality it's not so simple. It might work for the computations in a pre-trained network but it doesn't work at all for training. In training the values can easily get out of the [-1, 1] range, and also there is a need for a decent precision: if you update the weights by 1% of the gradient, that quickly gets out of the precision that can be represented in 16 bits. On top of this there is the trouble with representation: what value do you take as 1? If INT16_MAX, it's not a power of 2, so after you multiple two numbers, you can't just shift the values to take its top bits as the result (and no, it's not the top 16 bits of an int32, it would be bits 15 to 30 of int32). Before you do a shift, you have to do another multiplication, by (INT16_MAX + 1)/INT16_MAX. If you don't do it, the values will keep shrinking with consecutive multiplications. Of if you take 0x4000 to denote 1, the shifting becomes easier but you lose one bit of precision. All in all, it's lots of pain and no gain, so after a small amount of experimentation I've abandoned this idea at least for now (the vestiges of the code are still there but likely will be removed in the future), and went with the normal floating-point arithmetic. The integer implementation is one of those ideas that make sense until you actually try to do it.

The code sits in cpp/nn. It's not connected to the Triceps row plumbing yet, just a bare neural net code that works with raw numbers.

The first part I wrote was the computation with existing weights, so I've needed some weights for testing. I've remembered a story that I've read a few years ago. It's author, a scientist, was reviewing a paper by his colleagues that contained a neural network trained to recognize the dependency between two variables. He quickly realized that had his colleagues tried the traditional methods of curve fitting, they'd see that they've trained their neural network to produce a logarithmic curve. So how about I intelligently design a curve? Something simpler than a logarithm, a parabola: y = x2. It's pretty easy to do with the ReLU function to produce a segmented line: just pick a few reference points and make a neuron for each point. I've had this idea for a few years, but It's been interesting to put it into practice.

I've picked 4 points in range [0, 1] to produce 3 segments, with x values: 0, 1/3, 2/3, 1. Then computed the angle coefficients of the segments between these points (they came to 1/3, 1, and 5/3), and went to deduce the weights. The network would be 2-level: the first level would have a neuron per segment, and the second level would add them up. The second level's weights are easy: they're all 1, and bias 0. For the first neuron of level 1, the weight of the single input is equal to the angle coefficient 1/3 and the bias weight is 0. For the second neuron, we have to take the difference between the second and first angle coefficients, because they get added up on level 2, so it's 1 - 1/3 = 2/3. And the bias has to be such that this segment "kicks in" only when the value of x grows over the point's coordinate 1/3. Except that we have to work in terms of y that gets computed by weights, not x. So the bias = x * 2/3 = 1/3 * 2/3 = 2/9. Except that we also have to subtract it to make the lower values negative and thus cut off by ReLU, so bias = -2/9. And for the third point we similarly get the weight 2/3  and bias -4/9. The final formula is:

y = ReLU(ReLU(1/3 * x) + ReLU(2/3 * x - 2/9) + ReLU(2/3 * x - 4/9))

This is not the best approximation, since it all sits above the true function, but close enough:

  0.0 ->   0.000000 (true   0.000000)
  0.1 ->   0.033333 (true   0.010000)
  0.2 ->   0.066667 (true   0.040000)
  0.3 ->   0.100000 (true   0.090000)
  0.4 ->   0.177778 (true   0.160000)
  0.5 ->   0.277778 (true   0.250000)
  0.6 ->   0.377778 (true   0.360000)
  0.7 ->   0.500000 (true   0.490000)
  0.8 ->   0.666667 (true   0.640000)
  0.9 ->   0.833333 (true   0.810000)
  1.0 ->   1.000000 (true   1.000000)

I've computed the error on these values from 0 to 1 with step 0.1, and the mean error is 0.0166667, mean squared error 0.019341. Not that bad. Next I implemented the training by backpropagation, as described in https://www.researchgate.net/publication/266396438_A_Gentle_Introduction_to_Backpropagation. So what happens when we run the training over an intelligently designed weights? They improve. After running 100 rounds with the same values 0 to 1 with step 0.1, the results become:

  0.0 ->   0.000000 (true   0.000000)
  0.1 ->   0.013871 (true   0.010000)
  0.2 ->   0.047362 (true   0.040000)
  0.3 ->   0.080853 (true   0.090000)
  0.4 ->   0.156077 (true   0.160000)
  0.5 ->   0.256511 (true   0.250000)
  0.6 ->   0.356945 (true   0.360000)
  0.7 ->   0.484382 (true   0.490000)
  0.8 ->   0.651865 (true   0.640000)
  0.9 ->   0.819347 (true   0.810000)
  1.0 ->   0.986830 (true   1.000000)

The mean error drops to 0.000367414, mean squared error to 0.00770555, a nice improvement. Running the training for 1000 rounds makes the mean error very tiny but the mean squared error doesn't drop much. What happens is that the training had shifted the line segments down where they became mostly symmetrical relative to the true curve, thus the mean positive and negative errors cancelling each other out. But there is only so much precision that you can get with 3 segments, so the mean squared error can't drop much more. In case if you're interested, the formula gets shifted by training to:

ReLU(1.001182 * ReLU(0.334513 * x - 0.015636) + 1.002669 * ReLU(0.667649 * x - 0.225438)  + 1.002676 * ReLU(0.668695 * x - 0.441155) )

The changes in weights are very subtle but they reduce the errors. Which also means that at least in this particular case the 8-bit numbers would not be adequate, and even 16 bits would not be so great, they would work only in cases that have larger errors to start with.

And yes, there are the weights greater than 1! The can moved down to 1 by reducing all the weights in their input neurons but it apparently doesn't happen very well automatically in the backpropagation. It looks like the pressure for values (i.e. gradient) in the backpropagation gets distributed between the layers, and all the layers respond to it by moving in the same direction, some getting out of the [-1, 1] range. Reading on the internets, apparently it's a known problem, and the runaway growth of values can become a real issue in the large networks . The adjustment would be easy enough by rebalancing the weights between the layers but that would probably double the CPU cost of training, if not more. So for the small models, it's probably good enough to just ignore it. Well, except for one consequence: it means that the implementation of ReLU function must be unbounded at the top, passing through any values greater than 0.

Another interesting property of ReLU function is that its derivative being 0 for negative values, means that the values with the wrong sign won't be affected by training. For example, I've also created the negative side of the parabola with the same weights in negative (and for something completely different, connected them to a separate level 2 neuron), and training the positive side doesn't affect the negative side at all.

Another potential issue would apply to any activation function. The computation of the gradient for a particular input includes:

gradient = input[i] * sigma_next[j]

Which means that if the value is 0, the gradient will also be 0, and so the weights won't change. Which in particular means that a neuron that starts with all weights at 0 will produce 0 for any input, and won't ever get out of this state through backpropagation. That's one of the reasons to start with the random weights. I've found the document https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/  that talks about various tricks of initialization, and my guess is that their trick of initializing the bias weights to a small positive value is caused by this issue with zeros. But there might be other workarounds, such as messing with the activation function for gradient computation to make it never produce 0. This is the beauty of having your own code rather than a third-party library: you can experiment with all the details of it. But I haven't messed with this part yet, there are more basic things to implement first.