Tuesday, March 1, 2022

developing FloatNeuralNet in Triceps

For a while I've wanted to experiment with the neural networks but haven't got around to it. I did some playing with them in a training course a few years ago, but there it was a black box into which you plug the training data, not the same thing as making your own.

Doing it in Perl, as I've previously done with Bayesian examples, would probably be slow. Finally I've realized that Triceps is not that much different from TensorFlow: both allow implementing the data flow machines, just with different processing elements. So nothing really stops me from adding a neural network processing element to Triceps. And it would be a good platform for experimentation, and at the same time potentially useful as production code.

My first idea was to implement the neural network as short (16-bit or even 8-bit) integers, since apparently the people say that it's good enough precision.  In reality it's not so simple. It might work for the computations in a pre-trained network but it doesn't work at all for training. In training the values can easily get out of the [-1, 1] range, and also there is a need for a decent precision: if you update the weights by 1% of the gradient, that quickly gets out of the precision that can be represented in 16 bits. On top of this there is the trouble with representation: what value do you take as 1? If INT16_MAX, it's not a power of 2, so after you multiple two numbers, you can't just shift the values to take its top bits as the result (and no, it's not the top 16 bits of an int32, it would be bits 15 to 30 of int32). Before you do a shift, you have to do another multiplication, by (INT16_MAX + 1)/INT16_MAX. If you don't do it, the values will keep shrinking with consecutive multiplications. Of if you take 0x4000 to denote 1, the shifting becomes easier but you lose one bit of precision. All in all, it's lots of pain and no gain, so after a small amount of experimentation I've abandoned this idea at least for now (the vestiges of the code are still there but likely will be removed in the future), and went with the normal floating-point arithmetic. The integer implementation is one of those ideas that make sense until you actually try to do it.

The code sits in cpp/nn. It's not connected to the Triceps row plumbing yet, just a bare neural net code that works with raw numbers.

The first part I wrote was the computation with existing weights, so I've needed some weights for testing. I've remembered a story that I've read a few years ago. It's author, a scientist, was reviewing a paper by his colleagues that contained a neural network trained to recognize the dependency between two variables. He quickly realized that had his colleagues tried the traditional methods of curve fitting, they'd see that they've trained their neural network to produce a logarithmic curve. So how about I intelligently design a curve? Something simpler than a logarithm, a parabola: y = x2. It's pretty easy to do with the ReLU function to produce a segmented line: just pick a few reference points and make a neuron for each point. I've had this idea for a few years, but It's been interesting to put it into practice.

I've picked 4 points in range [0, 1] to produce 3 segments, with x values: 0, 1/3, 2/3, 1. Then computed the angle coefficients of the segments between these points (they came to 1/3, 1, and 5/3), and went to deduce the weights. The network would be 2-level: the first level would have a neuron per segment, and the second level would add them up. The second level's weights are easy: they're all 1, and bias 0. For the first neuron of level 1, the weight of the single input is equal to the angle coefficient 1/3 and the bias weight is 0. For the second neuron, we have to take the difference between the second and first angle coefficients, because they get added up on level 2, so it's 1 - 1/3 = 2/3. And the bias has to be such that this segment "kicks in" only when the value of x grows over the point's coordinate 1/3. Except that we have to work in terms of y that gets computed by weights, not x. So the bias = x * 2/3 = 1/3 * 2/3 = 2/9. Except that we also have to subtract it to make the lower values negative and thus cut off by ReLU, so bias = -2/9. And for the third point we similarly get the weight 2/3  and bias -4/9. The final formula is:

y = ReLU(ReLU(1/3 * x) + ReLU(2/3 * x - 2/9) + ReLU(2/3 * x - 4/9))

This is not the best approximation, since it all sits above the true function, but close enough:

  0.0 ->   0.000000 (true   0.000000)
  0.1 ->   0.033333 (true   0.010000)
  0.2 ->   0.066667 (true   0.040000)
  0.3 ->   0.100000 (true   0.090000)
  0.4 ->   0.177778 (true   0.160000)
  0.5 ->   0.277778 (true   0.250000)
  0.6 ->   0.377778 (true   0.360000)
  0.7 ->   0.500000 (true   0.490000)
  0.8 ->   0.666667 (true   0.640000)
  0.9 ->   0.833333 (true   0.810000)
  1.0 ->   1.000000 (true   1.000000)

I've computed the error on these values from 0 to 1 with step 0.1, and the mean error is 0.0166667, mean squared error 0.019341. Not that bad. Next I implemented the training by backpropagation, as described in https://www.researchgate.net/publication/266396438_A_Gentle_Introduction_to_Backpropagation. So what happens when we run the training over an intelligently designed weights? They improve. After running 100 rounds with the same values 0 to 1 with step 0.1, the results become:

  0.0 ->   0.000000 (true   0.000000)
  0.1 ->   0.013871 (true   0.010000)
  0.2 ->   0.047362 (true   0.040000)
  0.3 ->   0.080853 (true   0.090000)
  0.4 ->   0.156077 (true   0.160000)
  0.5 ->   0.256511 (true   0.250000)
  0.6 ->   0.356945 (true   0.360000)
  0.7 ->   0.484382 (true   0.490000)
  0.8 ->   0.651865 (true   0.640000)
  0.9 ->   0.819347 (true   0.810000)
  1.0 ->   0.986830 (true   1.000000)

The mean error drops to 0.000367414, mean squared error to 0.00770555, a nice improvement. Running the training for 1000 rounds makes the mean error very tiny but the mean squared error doesn't drop much. What happens is that the training had shifted the line segments down where they became mostly symmetrical relative to the true curve, thus the mean positive and negative errors cancelling each other out. But there is only so much precision that you can get with 3 segments, so the mean squared error can't drop much more. In case if you're interested, the formula gets shifted by training to:

ReLU(1.001182 * ReLU(0.334513 * x - 0.015636) + 1.002669 * ReLU(0.667649 * x - 0.225438)  + 1.002676 * ReLU(0.668695 * x - 0.441155) )

The changes in weights are very subtle but they reduce the errors. Which also means that at least in this particular case the 8-bit numbers would not be adequate, and even 16 bits would not be so great, they would work only in cases that have larger errors to start with.

And yes, there are the weights greater than 1! The can moved down to 1 by reducing all the weights in their input neurons but it apparently doesn't happen very well automatically in the backpropagation. It looks like the pressure for values (i.e. gradient) in the backpropagation gets distributed between the layers, and all the layers respond to it by moving in the same direction, some getting out of the [-1, 1] range. Reading on the internets, apparently it's a known problem, and the runaway growth of values can become a real issue in the large networks . The adjustment would be easy enough by rebalancing the weights between the layers but that would probably double the CPU cost of training, if not more. So for the small models, it's probably good enough to just ignore it. Well, except for one consequence: it means that the implementation of ReLU function must be unbounded at the top, passing through any values greater than 0.

Another interesting property of ReLU function is that its derivative being 0 for negative values, means that the values with the wrong sign won't be affected by training. For example, I've also created the negative side of the parabola with the same weights in negative (and for something completely different, connected them to a separate level 2 neuron), and training the positive side doesn't affect the negative side at all.

Another potential issue would apply to any activation function. The computation of the gradient for a particular input includes:

gradient = input[i] * sigma_next[j]

Which means that if the value is 0, the gradient will also be 0, and so the weights won't change. Which in particular means that a neuron that starts with all weights at 0 will produce 0 for any input, and won't ever get out of this state through backpropagation. That's one of the reasons to start with the random weights. I've found the document https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/  that talks about various tricks of initialization, and my guess is that their trick of initializing the bias weights to a small positive value is caused by this issue with zeros. But there might be other workarounds, such as messing with the activation function for gradient computation to make it never produce 0. This is the beauty of having your own code rather than a third-party library: you can experiment with all the details of it. But I haven't messed with this part yet, there are more basic things to implement first.

No comments:

Post a Comment