Sergey Babkin on CEP and stuff: optimization 7 - follow-up to neural network corollaries

I was able to do some experimentation, so I have a follow-up to the last post. If you want to refer to the FloatNeuralNet code (and especially do it in the future when the code will have progressed forward), the current version is https://sourceforge.net/p/triceps/code/1733/tree/ .

First of all, it was bothering me that the indexing in the multidimensional training arrays of FloatNeuralNet was done all by hand and hard to keep track of, so I've changed it to use a more slice-like abstraction of ValueSubVector and ValueSubMatrix. And I've found a bug that was mixing up the first layer's weights pretty badly: an extra number that was supposed to be added to value was added to index, and read some garbage from memory (but not far enough to get out of the used memory and be detected by valgrind). Yes, a direct result of doing too much stuff manually. And it's been hard to detect because it didn't always show much, and with so many repeated computations it's rather hard to keep track of whether the result is correct or not. Fixing this bug removed the real bad results. Still, the results aren't always great, the square error for my test example on random seedings grouped into two groups: one at almost-0.09, one at 0.02-something, while starting with a manually constructed good example gave the error of 0.0077. (To remind, my basic test example is not very typical for the world of machine learning but tries to do an "analog neural network" that computes the square of a number between 0 and 1).

Next I've tried the reclaiming of unused neurons by just inverting the signs of weights, and setting the next-layer weights for its inputs to 0. With the bug fixed, this worked very well, pulling that weight out of 0 very nicely.

Next I've added a cumulative error and gradient computation to the main code (previously the error was computed by the test code). This introduced the methods for the start and end of the pass that initialize and summarize the stats for the pass, and methods for extracting the stats. As I have expected, the errors computed during the pass were mostly larger than the true error computed at the end of the pass, because of each training example of the pass receiving the weights in the state that's pulled by all the other training examples away from its optimum. But sometimes it also happens the other way around, I don't know why.

Computing the gradients and errors in a stable state is easy to do too: just do a pass with rate=0, so that the weights don't get changed. And then I've added a method to apply the computed gradient after the pass. So the whole procedure looks like this:

nn.startTrainingPass();
for (int i = 0; i < num_cases; i++) {
  nn.train(inputs[i], ouputs[i], 0.); // rate 0 only computes the gradient
}
nn.endTrainingPass();
nn.applyGradient(rate);

This actually does converge better than doing the gradient steps after each training case, most of the time. The trouble is with the other times: I've encountered an example where each step caused a neuron to become dead, and the training didn't work well at all. What happened is that the random seeding produced two dead neurons, and then after a reclamation, for all I can tell they ended up fighting each other: one pulling out of 0 pushed the other one into oblivion. The exact same example didn't exhibit this issue when doing the steps case-by-case. Which explains why the case-by-case updates are the classic way, they introduce the jigger that makes the gradient descent more stable in weird situations. BTW, you can mix two approaches too, and it works: just use non-0 rate in both train() and applyGradient().

Well, so far so good, so I've started looking into why do the random seedings produce two groups error-wise. I've even encountered a seeding that produced a sub-0.09 error but with the gradient going into 0, so it obviously thought that this is the best fit possible! It took me some time to understand what is going on, and then I've realized that all the sub-0.09 seedings produced a straight line dependency instead of an approximation of a parabola! And the 0.02-something seedings created a line consisting of either 2 segments, or 3 segments (which is as good as it gets with 3 neurons) but one segment being very short.

For all I can tell, the reason is that the network is designed to do a linear regression, and successfully does that, fitting a straight line as good as it gets. With the ReLU activation function (which is the only one I've tried so far), the nonlinearity is introduced by the "breaking point" that cuts off the values below 0, but there is nothing at all that optimizes this breaking point. It's position is merely a side effect of the multipliers and offset weight, and those get opimized for the position of the straight line, the breaking point becoming just a side effect. So this breaking point gets determined purely by the random seeding. I guess this might not matter for many of ML applications when the straight lines are good enough and there are many more possible combinations of input signs than there are neurons to make even the straight lines. But it explains why some kinds of ML models (if I remember right, the text processing ones) work with a sigmoid activation function but not with ReLU, even though theoretically sigmoids can be generated from ReLU (this is really the point of my tiny test model, can you generate an approximation of smooth function such as sigmoid with ReLU?). I'm not even sure that sigmoid works that well, since there the line has two (smoothed) "breaking points", which for all I can tell are not directly optimized either but positioned as a semi-random side effect of other optimization pressures. Maybe a way to combine the best features of ReLU and sigmoid would be to have a V-shaped activation function where two halves of the V get split off and fed to different inputs in the next layer, and the position of the center of the V gets directly driven by its own gradient. Yes, this would increase the cost of the NN, but only linearly, not quadratically, as increasing the number of neurons per layer would. But I have no answer yet on how exactly to do this, and whether it's possible at all.

Sergey Babkin on CEP and stuff

Monday, October 17, 2022

optimization 7 - follow-up to neural network corollaries

No comments:

Post a Comment

Links

About Me

Labels

Blog Archive