Sergey Babkin on CEP and stuff: 2022

Sunday, December 25, 2022

optimization 12 - adaptable methods of neural network training

In the last post I've been wondering, why aren't the momentum descent methods widely used for the neural network training. And guess what, a quick search has showed that they are, as well as the auto-scaling methods.

In particular, here is an interesting paper:

https://arxiv.org/pdf/1412.6980.pdf

Adam: A Method for Stochastic Optimization

What they do looks actually more like a momentum method than selecting the rate of descent. This paper has references to a few other methods:

AdaGrad
RMSProp
vSGD
AdaDelta

And I've found this algorithm through another web site that talks about various ML methods:

https://machinelearningmastery.com/adam-optimization-from-scratch/

Thursday, December 22, 2022

optimization 11 - FISTA momentum descent in FloatNeuralNet

Continuing from the previous post, I've had a look at the characters in the test set that have trouble getting decoded. Well, they truly are handwritten, I've had difficulties recognizing some of them as digits at all. But a good deal of them are rather similar variations on the digit "6" that I guess are just not represented well in the training set. For many of these digits the even highest recognized class has the value deep in the negative zone, which means that they're not like anything in training. I think there should be some way to adapt the training set, such as expand it by including the digits shifted left and right by a pixel or two. Need some more thinking on that.

I've looked at the training set and counted how many training cases have their best guess negative (even if it's the correct one), and it came out at about 16.5%. I've tried to shift the training weights to prioritize cases with the low best guesses just as I've done with the wrong guesses, and it didn't help at all, the best I could scrape out was about an extra 0.2%. Maybe it's just because for every 1 positive output there are 9 negative outputs, and on average the values tend to be pulled down.

In the meantime, I've added the momentum descent, based on the FISTA algorithm. I've learned some tricks playing with my version of FISTA that I plan to add later, but so far just a very basic version, with only one trick included: when applying the momentum, compute the saturation by the limits, and adjust the momentum for it. So that if some dimension hits the limit, the momentum won't continue pushing it past the limit again and again but will reset on that dimension.

I've been trying to get the automatic descent rate deduction working properly before implementing the momentum but I gave up on that for now and decided to just try the momentum. It is a bit sensitive to the training rate being too high, maybe because there is a higher chance of driving into a situation where a high rate would give an overshoot. But bump the rate down, and the momentum descent works very, very, very well, doing a much faster progress. The mean square error for the training set went down from the previous-best 0.025 to 0.01. The bad news is that it didn't help with the test set at all: it again goes better up to a limit, and after that as the training set's stats get better, test set's stats get worse. So I think I really need to accept that the test set really truly contains the characters that are not well represented in the training set. And probably the convolution won't fix it either.

Given that the momentum descent works so awesomely, I don't understand why isn't everyone using it. It adds cost per pass but then it reduces more the number of passes needed to achieve the same precision, giving a large net win. Or maybe they do use it and don't tell everyone. Or maybe it doesn't work well for all the neural networks and I've just got lucky with my example. I don't know.

The way FISTA works is basically by remembering the last step and repeating it before computing the gradient at the new position and doing a new step. Then remembering the total step (repeat + new step) for the next step. It also has a little optimization in scaling the remembered last step by a factor a little under 1, and very gradually reducing this factor. This downscaling is intended to reduce the overshoots and "circling the drain" when the position gets close to the optimum. It also limits the maximum momentum, when the downscaling starts shrinking the momentum more than the current gradient accelerates it. Another little thing it does is that it skips the repeat not only on the first step (where it just has nothing to repeat) but also on the second step. For all I can tell, the reason for this skipping of momentum on the second step is that the initial point is usually random, and if the gradient approximation is good, it would put the point after the first step into a decent vicinity of the optimum. Which means that the first step will be large, and also stepping in a random direction that is nowhere close to the direction of the following steps. Thus it's better to not include this first large random step into the momentum that will keep repeating.

The mechanics of enabling the momentum mode in FloatNeuralNet are: set the option

options.momentum_ = true

And now there also are methods for reading the best-result-still-below-0 stats:

getPassAbove() returns the number of cases in the last pass where the correct and best outcome is above 0

getPassNotAbove() returns the number of cases in the last pass where the best outcome is either incorrect or not above 0

Tuesday, December 20, 2022

FloatNeuralNet classifier training mode

Continuing the story from the previous post, I've been experimenting with the MNIST classifier of handwritten digits from ZIP codes some more, and I've tried to improve on it by adding a special mode for gradient computation:

The point of a classifier is that it produces a result with the highest value at the output that corresponds to the correct classification, and the exact values on the outputs don't mater that much. So when training, it makes sense to check whether a particular training case already produces the correct result, and if so then "dampen" it to give the training cases that are still producing incorrect results a greater chance to shift the weights for their benefit. I've done this "dampening" by multiplying the gradients of the correct cases by a value less than 1. I've started with this value set at 1/1000 but that tended to produce rather wild swings where a small number of incorrect cases would pull the weights towards themselves only to make a large number of formerly correct cases incorrect. I've reduced it to 1/100 and that worked out quite well. I'm not sure if the selection of this multiplier depends on the number of training cases, FWIW, the MNIST set has a little over 7000 training cases.

And to make sure that the training doesn't get too slow, I've added the auto-rescaling of the training rate, so that the potential sum of the gradients would total up the same. This made it a bit more touchy, since there is a greater chance of the gradient getting dominated by only one training case, and a smaller base rate (I've reduced it by a factor of 3) was needed to avoid the meltdown into overshoots. Maybe I'll embed this safety factor into the auto-rescaling at some point, I'm just not sure yet how to pick it for the general case.

The end result is a bit mixed. It worked very, very well for the training set, bringing the error rate (i.e. the number of cases that produce a wrong value) to zero. And the mean square error went the same or slightly better as with the plain implementation, meaning that not only the previously neglected cases are getting attention but also they get incorporated better into the common model and don't hurt the others. The mean square error on the test set also went slightly better, the best I've seen so far at 0.193 (an improvement of almost 0.01 over the plain version). But the error rate on the test set, which I hoped to improve, still stayed at around 5.5%, not really an improvement there. I've seen it go down to 5% and then grow again on the following rounds of training, while the mean square error became better.

So for all I can tell, the test set really does contain some characters that are substantially unlike any of the characters in the training set. And probably the only way to do better is to actually look at them and figure out what is different about them and then manipulate the training set to introduce variability by generating multiple slightly different versions of each character. I've kind of got interested in this puzzle, so maybe I'll find time to try this. Or maybe I'll try to implement the convolution first but by now I doubt that it would help much with this specific problem on its own. Maybe a general way to approach this variability would be to do some unsupervised classification of the images of the same digit, to produce the groups of distinct ways to write it. But that looks difficult.

Another simple thing to try would be to make the classifier preference even stronger, requiring that not only the correct output is the highest but also that it beats the next highest output by a certain margin. And maybe even vary the gradient weight in a reverse proportion to this margin. But I doubt that it would solve the problem with the stubborn 5.5% of the test set.

Yet another interesting development is that the CORNER activation seems to work quite well. I've added the third layer with it, and it produced better results. My previous impression that the deeper networks train slower was based on a limited number of training passes, the larger models are slower there, but as the pass count goes into multiple thousands, more neurons do help to squeeze out a better quality. Of course, each pass is slower with more neurons, but as far as the progress per pass is concerned, the larger networks do better on the later rounds.

Where the new feature is in the code:

https://sourceforge.net/p/triceps/code/1779/tree/

Setting the FloatNeuralnet option isClassifier_ to true enables this new classifier training mode. The option correctMultiplier_ defines the penalty gradient multiplier for the cases that are already correct (the default of 0.001 there is a bit too much of a penalty, 0.01 looks better). And since the correct and incorrect cases are counted for re-scaling the training rate appropriately, these counters are now also available externally too via methods:

size_t getPassCorrect() const;

size_t getPassIncorrect() const;

This allows to compute the error rate easily from the results of a training pass. They get computed only if the classifier training mode is enabled, and they show the error rate as it was before the last training pass, it will change after applying the gradient. The classifier mode can be used with both the piecemeal steps of applying the training rate in train() method and with applyGradient(), but only applyGradient() is able to do the automatic scaling of the training rate to compensate for the shrinkage.

Sunday, December 18, 2022

FloatNeuralNet on the MNIST set, and checkpointing

Since I've been experimenting with neural nets on an "unnatural" example, I've been wondering, how would the things I've tried work on a more natural one. So I've made a demo for the MNIST example of recognizing the handwritten digits from ZIP codes. It's implemented in the branch https://sourceforge.net/p/triceps/code/1776/tree/ , in the file nn/demo/d_FnnZip.cpp.

The code looks for the MNIST files that can be downloaded from https://hastie.su.domains/StatLearnSparsity_files/DATA/zipcode.html placed in any ancestor directory's subdirectory "zipnn" (so I usually put it next to Triceps's top directory).

The original files are scans with 16x16 dots per digit but it has turned out to be rather slow in processing, so I've made the demo downscale them to 8x8 before processing.

Another thing that was rather disappointing was that I didn't save the state of a trained NN before. It's OK for small and quick training but with a slower and longer one, the results really need to be preserved. So I've added the FloatNeuralNet methods checkpoint() to dump the state to a file and uncheckpoint() to get it back. It can also be used creatively to do some training with some settings and then load it into an instance with different settings to see if something would improve.

With 8x8, it's 64 inputs, and 10 outputs for 10 recognized digits, and as a baseline I've used 2 inner layers of 64 neurons each: 64->64->64->10. But I've experimented with more and fewer neurons, and more layers.

So, what are the observed results?

First, the RELU variations work quite decently, probably because the original data is already centered on the [-1, 1] range.

The larger models are slower to train, nor just in execution time per pass, but also each pass makes a smaller improvement.

The web page talks that it's a hard problem and the error rate on the test set at 2.5% is very good. Well, the best I could do so far is about 5%. The problem with this data set is that the test set differs a good deal from the training set. So when the training set gets to a mean square error of 0.07 and the error rate of 0.5% (and keeps improving), the test set gets stuck at a mean square error of 0.2 and error rate of about 5%. Worse yet, as the error rate on the training set keeps getting a little better, on the test set it gets a little worse. Which means that the NN makes a close fit for the training data, it doesn't fit the test data. So I've tried to reduce the number of neurons per layer to 32, and the result got a little worse overall, and it still does this divergence between training and test data.

The CORNER activation showed not spectacularly but decently. Since it has a "mini-layer" on top of each normal layer, I was able to remove one layer to sizes 64->64->10, and it did just a little better than LEAKY_RELY in 64->64->64->10, and each pass got computed faster. In fact, the best seen error rate of 5% came with CORNER at about 10K passes and then deteriorated a little to about 5.3% at 20K passes, while the best I've seen with LEAKY_RELU is 5.5%. But those small differences may also be due to randomness. I've used the same starting point in the random number generation but with the different number of weights it plays out a bit different. Adding an extra layer with CORNER brings the training mean square error to below 0.05 and error rate down to below 0.3%, and even the test mean square error goes below 0.2 (which it didn't do in other combinations), but the test error rate stays at about 5.6%, so it's improving the fit but still overfitting worse than without the extra layer.

With the 32-neuron middle layers, the test error rate gets a bit worse, around 6.6% after 20K passes of training.

I've thought that maybe one of the sources of trouble is that with the inputs having absolute values less than 1, and with all the weights less than 1, it might be hard to produce the outputs of +-1. So I've tried to set the expected outputs to +-0.1, and it really made no difference. Increasing the maximum weight limits to +-100 didn't make a whole lot of a difference either (but made things slightly slower to train). Interestingly, taking a trained model with limits of +-100 and trying to continue training it with limits of +-1 produces a quite bad result that has a hard time digging out of the suboptimal point.

Applying the changes after each training case vs at the end of the pass didn't make a noticeable difference either. I'm thinking that applying at the end of the pass should provide an obvious way for parallelization: compute the partitioned set of training cases in multiple threads, each one summing up its own copy of the gradients, and then sum the partial per-thread results together.

Using the "floor" option that makes the sign-flipping of the weights more "structured" (on the first push it stops at the same sign but a small non-0 value, on the second push it switches the sign of that small value, and on the following pushes it would grow with the switched sign) made the training a little slower.

The automatic training rate estimation didn't work well at all and I don't know yet how to fix it.

What things to try next?

One obvious thing to combat overfitting would be to try some batching, and it should make the processing faster too.

Another idea I've got is that for a classifier that chooses only one category (the one with highest value), it's more important that the right neuron gets the highest output than to hit the target values exactly. So maybe taking a page from Adaboost's book, the training cases that produce the right highest output should have their gradients reduced (since they are "already right"), to give the training cases that are still wrong a higher relative weight for the next move.

And the next idea yet would be to try the convolution, which would be another step in adding complexity.

P.S. I've tried the batching. It made the error rate worse, and the larger the batch size the worse it gets. A kind of good news is that it worsened the training error rate more than the test error rate, so it definitely works against the overfitting, at least kind of and somewhat. Well, the training error rate is still lower than the test one, but with a lower relation between them.

Sunday, November 27, 2022

optimization 10 - Lipschitz costant and the last layer of neural networks

After, as described in the last post, the empirical computation of Lipschitz constant for the lower layers of a neural network didn't pan out, I've tried doing it for the last layer. The last layer minimizes a honest squared function, so it should not have the same issues as with the previous layers. And I've done it in code, see https://sourceforge.net/p/triceps/code/1770/tree/.

It computes 1/L empirically like the TFOCS library: keeps the weights and gradients from the last pass (including, by the way, the inputs to the last layer too, since they are variables in the formula too), finds the differences, computes the norms and eventually 1/L, and compares it with the descent rate used for the pass. If there is room, it gradually grows the rate. If 1/L had shrunk and the last rate was too high, it reduces the rate and redoes the previous step (this is somewhat tricky, as the gradient is computed only in the following step, but I've got the rollback to work). Most of the time one rollback is enough to set things straight. But if not, well, it's just one overshoot that would get straightened in the following steps, so I've limited the rollback to be done no more than once in any case.

The results are rather ambivalent. If the whole model consists of one Corner-activated neuron, it works great, finds the right rate, and converges quickly. The extra overhead is definitely worth the trouble there. Though interestingly, it first finds a solution with a slightly lower squared error, and then as it does more steps, the gradient reduces but the squared error grows a little. My guess is that it's probably caused by the adjustment to the Corner's offset weight that is not tied directly to the error, and what it thinks is the best balance doesn't produce the lowest error. Something to consider in the future.

With more neurons, things go worse. One aspect that definitely works well is preventing the runaway growth of the weights to infinity. I've tried an example that experienced such a growth before, and it didn't diverge any more, even though it produced some weights around 50. But the trouble is that the found descent rate is way too high. It works much better if I limit the rate to about 1/10 of the computed. It descends there rapidly but then gets stuck circling with about the same error. Reducing the rate to 1/100 makes it converge slower but gets to a lower error. 1/1000 produces an even lower error. Which means that it gets the rate wrong, and just deciding on the descent rate from the last layer is not enough, the lower layers also need to be taken into account. The idea with estimating the error in the previous layer from both gradient and descent rate of its successor should work here but it might involve a good deal of overhead.Needs more thinking.

On the ReLU and Leakey ReLU it works very similarly but doesn't resolve the issue with centering the offsets, and the rate overestimation issue is particularly sensitive for the basic ReLU, since there it can easily lead to neurons becoming dead.

So I've thought alright, maybe the overshoots are caused by the weights flipping the sign too easily with the larger rate. So I've added the idea of "weight floor": defining a small number (let's call it r) as the floor and making the values (-r < weight < r) illegal for the weights. Then if the descent tries to push a weight towards 0, crossing the +-r boundary, we'd stop it at the appropriate r or -r for this pass. If on the next pass the descent keeps pushing the weight further in the same direction, we'd "teleport" it from r to -r or the other way around from -r to r. This would prevent the large jumps, and avoid getting stuck at 0, and give a chance for a new gradient to be computed before going further, so that if it was an overshoot, it would get a chance to go back without causing a change of sign. And if different weights end up moved at a different rate, it shouldn't matter much, because they are still moved in the right direction. Overall it works, and the "teleports" do happen occasionally. But it didn't resolve the overshooting issue. I guess the trouble is mostly with the overshoots that don't cause a sign flip.

Saturday, November 26, 2022

optimization 9 - Lipschitz constant and earlier layers of neural networks

I've tried to follow through with the path described in https://babkin-cep.blogspot.com/2022/10/optimization-8-descent-rate-and-v.html, find the Lipschitz constant for the neural network to both try to make the neural network converge faster and to avoid overshooting badly when the weights at the neurons grow larger than 1. In that previous post I've looked at the last layer, so then I've figured that I should look at the previous layers: since they don't have a square, they might be easier. Or maybe not: they feed into all the following layers.

But maybe we can make them easier. We can notice that in the original formula for gradient descent every step is multiplied by the descent rate. So maybe we can just pull that descent rate factor and "put it outside the parentheses". Which means that when we propagate the gradient from the last layer backwards, we can multiply it once by the descent rate, and that would allow us to skip multiplying by it everywhere else (which is also an optimization in the software sense!). The result of this multiplication (using r for the descent rate and v for the activated output of a previous layer's neuron)

m = r * df/dv

also has can be seen as having a "physical meaning" that says "I want to shift the output v of the previous-layer neuron by this much to achieve the next value of v" (here t is the index of the training pass):

v[t+1] = v[t] - m.

In reality, of course, the output v feeds multiple neurons of the previous layer, and these feedbacks get added together. But that can be folded into the same physical interpretation too: if this output feeds into N neurons, we can say that

r = R / N

where R is some "basic descent rate", and so

v[t+1] = v[t] - R * sum{i}(df[i]/dv) / N

That is, the addition turns into averaging with a different rate R.

But wait, there also are multiple training cases (K of them), the gradients of which also get added up. That's not a problem, because the outputs from the different training cases also get added up.

An interesting corollary from this seems to be that when a neural network has a different number of neurons per layer, the descent rate should be normalized to the number of outputs, and the layers that have fewer outputs can use a higher descent rate.But I suppose there is also a counter-argument that if there are many outputs, the feedback from them is more likely to be contradictory, so the sum is more likely to be closer to 0 rather than all pointing in the same direction, so a higher rate may be useful there. That's probably why nobody bothers with adjusting the descent rate per layer.

So all right, let's build a simple example. Suppose we have a ReLU-like activation function with gradient of 1 and a neuron with 2 inputs and an offset weight

v = u[1] y[1] + u[2] y[2] + b

And since the training cases are added up linearly, we can just start with an example with one training case. Then we could write the function to minimize as

v = u[1] y[1] + u[2] y[2] +b - m

(getting ahead, this expression turns out to be wrong, but I want to get gradually to why it's wrong). The partial derivatives will be

dv/du[i] = y[i]

dv/dy[i] = u[i]

dv/db = 1

and computing at two points and finding the difference in the dimensions of gradients and arguments:

L^2 = ( (y[1, 1] - y[1, 0])^2 + (u[1, 1] - u[1, 0])^2 + (y[2, 1] - y[2, 0])^2 + (u[2, 1] - u[2, 0])^2 + (1-1)^2) /
( (u[1, 1] - u[1, 0])^2 + (y[1, 1] - y[1, 0])^2 + (u[2, 1] - u[2, 0])^2 + (y[2, 1] - y[2, 0])^2 + (b[1] - b[0]^2)
<= 1

because the lower side of the fraction has all the same terms for u and y (just in different order), and an extra positive term for b. So (1/L) >= 1, and propagating the same rate of descent or more should be what we need.

To test this conclusion, I've made up a small example with concrete numbers and did what the TFOCS library does, do some steps and compute the approximation of L by dividing the difference of gradient between the steps by the difference of weights. And... it didn't work out. Not only it produced L > 1, but I'd go back and adjust the step to be smaller, and the new L would be even larger, and I'd adjust again to an even smaller step, and L will get even larger than that again.

What went wrong? I've got the formula wrong. Looking at the expression

v = u[1] y[1] + u[2] y[2] + b

it obviously has the minimum where all the variables are 0. So yeah, if we use L=1 and step straight to 0, that would minimize it. But we don't want to get to 0, we want it to get to some value (v[t-1] - m). The right formula for minimization should be

abs( u[1] y[1] + u[2] y[2] +b - (v[t-1] - m) )

The absolute value part was missing. And I don't know how to compute L with the absolute value. I guess one way to replace it would be to substitute a square for absolute value:

( u[1] y[1] + u[2] y[2] +b - (v[t-1] - m) )^2

It can probably be done just for the computation of L, even when the neural network itself stays without squares. But I'm not sure.

Another interesting observation is what happened when my steps in the test have overshot the empirically computed L: they would flip the sign of the weights. Which means that the occasional overshoots are important, without them the signs would stay as they had been initially assigned by the random seeding. Even if we could compute and properly use the right value of L, we might not want to, because we'd lose the sign flips.

Thursday, November 3, 2022

V-shaped activation functions what-ifs

I've noticed that a simulation of the square function with a single neuron plus CORNER (V-shaped) activation function got worse after I've clamped the weights to the range [-1, 1]. It just couldn't produce a sufficient derivative to go to the full slope of the quadratic function, so it defaulted to the same shape as ReLU, going at slope 1 for the whole range of inputs [0, 1]. How about if I increase the allowed range of weights to [-10, 10]? Much better, then it could simulate the bend, producing the total maximum slope of about 1.5. So it looks like being able to go to the higher weights might be useful, just needs to be accompanied by the dynamic computation of L and the training rate from it.

Next, it looked like the right-side coefficient of the V can be fixed at 1, since the slope produced by the neuron itself would also work to produce the same effect. As it turns out, this doesn't work well. The trouble happens when the right side needs to change sign. When the right side of the V adjusts, it changes the sign easily. But without that, changing signs in the weights of the neuron itself doesn't work well, it often gets stuck. So this change doesn't work well. It's kind of interesting that when there is one variable, it changes the sign easily, if there are two, they get stuck, but if there are three, they change signs easily again. Though the real reason is probably that two variables are pushing against each other by using each other's weights as gradients, while the third variable doesn't, it just gets driven from the top. So maybe if one of two variables would be just driven from the top, that would prevent two variables from getting stuck too. Will need to think about that.

Next, what if we do a common Leaky ReLU but adjust the offset using the same formula as with the CORNER? (Leaky ReLU has to be used because with plain ReLU the gradients on the left side would be lost, and the computation for he offset won't work). This worked about as bad as normal Leaky ReLU, and maybe even worse, so it's definitely not a solution on its own.

So it looks like the CORNER function does have value. And maybe the right-side coefficient can be rid of if the main neuron weights can be made to change sign more actively, perhaps by making one of them fed the gradient of the top, or by applying the same gradient to them in both ways.

Wednesday, November 2, 2022

demo build target

My experimental code for FloatNeuralNet has grown kind of large, and it runs kind of slow with valgrind, which interferes with the normal tests. So I've added a new entity in the C++ code: "demo", and moved all the slower experimental examples there. They are just like tests, only sit in the "demo" subdirectories, in the files starting with "d_". The target to build them is "mkdemo", to run without valgrind "demo" or "qdemo", with valgrind "vdemo" (this is different from the tests where the target "test" runs with valgrind).

BTW, the target "mktest" didn't run the library build as a dependency. It does now, in the main cpp directory (but not in the subdirectories). And the same applies to "mkdemo".

Sunday, October 30, 2022

optimization 8 - the descent rate and V-shaped activation

In the last post I've talked about how the computation with the V-shaped ("corner") activation function sometimes diverges badly. It turns out, the activation function as such is not the reason.

Another change I've done at the same time was to remove the limit on the weight values (or practically speaking, changed it from +-1 to +-1e10). That was the reason for divergence. Set the limit back to 1, and the divergence disappears. On the surface, that limit of 1 was arbitrary, so if we want to model arbitrary functions, it made sense to remove this limit (and in particular, to do a XOR in a single layer, I needed the values equal to 2). But unfortunately there is a connection between the weights (i.e. multipliers) and the rate of descent that can be used. Here is how it works.

There is a way to evaluate, what rate of descent is safe, it's used in the ISTA algorithm and uses the Lipschitz constant L. The safe rate of descent (that doesn't cause us to overshoot the minimum and diverge) is 1/L. L is defined as follows:

L = max( || grad_f(x) - grad_f(y) || / || x - y || )

What is what in this expression:

x and y are two arbitrary points in the multidimensional space of weights being optimized. I.e. each one of them is a vector (i.e. an array in programming terms) of all the weights.

grad_f() is the gradient (i.e. the multi-dimensional derivative) of the function being optimized. It gets computed at the argument values x and y. Note that even though the function produces a single value, the gradient is also an array, of the same size as the function argument (because it contains a partial derivative for each input dimension).

The double-vertical-line "parentheses" mean the norm operation, more exactly the second norm. It's something that I either haven't seen or didn't remember before from college but the norm operation is widely used in optimization. It's meaning is computing the length of the vector in the physical sense. There is this difference between math (and programming) and physics: in math (and programming) the length of the vector is the number of the dimensions in it, in physics it's the length of the arrow that goes from the origin to the point determined by the values of the vector as coordinates in a multidimensional space. So if you want to talk about this "physical length" in math, you'd get confused between two lengths, and thus they use a different word "norm" for it.

There is also more than one kind of norms. In general they are written with the index, like ||x||_2, but if the index is not specified, 2 is assumed. The first norm, that also could be written as |x|, represents the distance as it's walked "in steps" along the coordinate axes. Yes, it's no accident that it's written in the same way as the absolute value: for a single dimension, the absolute value is the distance we walk form the origin to the value on a single axis. So norm 1 is the generalization of an absolute value for a vector. For another example, if we have a vector [4, -3], then we'd walk 4 steps along one dimension and 3 steps long another dimension, and norm 1 will be 7. It can be written as:

|x| = sum( abs(x[i]) )

The second norm uses squares and represents the distance in Euclidean space:

||x|| = sqrt( sum( x[i]^2 ) )

So for a vector [4, -3], norm 2 will be sqrt(4^2 + (-3)^2) = 5. Apparently there are higher norms too but I didn't look at them.

Returning to the formula for L, with all the parts parsed, the meaning of it is "if we do a step from any point x to any other point y, what will be the maximum change of gradients between these points, in relation to the length of the step taken?" This is important for the gradient descent because there we take the steps proportional to the gradient. So we make a step that is based on the gradient, and after the step we get another gradient. And we don't want to overstep the minimum, which means we don't want to go beyond the point where the gradient is 0. So if L is large, meaning that a small step can cause a large change in gradient, we should be taking only small steps, and if L is small, meaning that nothing much changes until a large step, we can take the large steps. Hence the safe rate of descent is 1/L.

For the quadratic polynomial functions, L can be computed as a constant. The expression for L divides the difference in gradient by the difference in x, which is essentially the second derivative, which is constant for the quadratic polynomial functions. But it's a derivative computed from the worst possible point in the worst possible direction, so this constant may underestimate the allowed steps. Consider a simple function

f = 1000 * x[1]^2 + x[2]^2

Here the dimension x[1] with its high coefficient of 1000 is what will drive the constant L to a high value. This value of 1/L will be so good that it will really get us to the optimal value of x[1] in a single step, essentially applying Newton's method! (Possible for the simple functions that don't have dependencies between two variables). But then we'll have to take 1000 steps of this size in the dimension x[2] to get to its optimum. Not great but at least safe. There is a Matlab package for optimization called TFOCS, and what they do is measure the actual L from the recent steps: we've changed x[] by so much, and gradient has changed by so much, we can divide them and assume that it's a reasonable estimation of L for the part of the function where we're traveling. And hopefully we're traveling towards the optimum, where the gradients become smaller, so we can assume that L should be getting smaller too. So in this example, after TFOCS gets the dimension x[1] right, it will detect that in the following steps L becomes smaller, and it can take the larger steps.

We can do it for the more general functions too, where the second derivative is not constant but depends on X, with similar assumption that were' traveling towards the areas with smaller gradients, but they're kind of difficult in our case.

We can write out the loss function to minimize in the simplest case (for one neuron of the last layer) as

f = (x[1] * x[2] + x[3])^2

Here x[1] is one input, x[2] is one weight, and x[3] represents the whole rest of the neuron bunched together: the sum of other products of inputs and weights, and subtraction of the true result from the training case. Since I'd like to avoid dragging around the indexes, I'll rename the variables: x[1] will become x, x[2] will become y, and x[3] will become b:

f = (x*y + b)^2

Then the dimensions of its gradient will be:

df/dx = 2*y^2*x + 2*b*y = 2y(xy + b)

df/dy = [similarly] =2x(xy + b)

df/db = 2*x*y + 2*b = 2(xy + b)

Yes, b counts here as a variable too, cumulative of its internal parts. Let's do the initial computations for the square of L, then we can use the squares of the norms and not mess with roots for now. There let's use the indexes to identify the point, so the dimensions (x[0], y[0], b[0]) will be one point and (x[1], y[1], b[1]) will be the other point. Then the squared norm 2 for the points will be:

|| (x, y, b)[1] - (x, y, b)[0] || ^2 = (x[1] - x[0])^2 + (y[1] - y[0])^2 + (b[1] - b[0])^2

And the squared norm 2 for the gradients will be:

|| grad_f[1] - grad_f[0] ||^2 =
= (df/dx[1] - df/dx[0])^2 + (df/dy[1] - df/dy[0])^2 + (df/db[1] - df/df[0])^2 =
= (2 y[1] (x[1]y[1] + b[1]) - 2 y[0] (x[0]y[0] + b[0]) )^2 +
+ (2 x[1] (x[1]y[1] + b[1]) - 2 x[0] (x[0]y[0] + b[0]) )^2 +
+ (2 (x[1]y[1] + b[1]) - 2 (x[0]y[0] + b[0]) )^2

Finding or even estimating the whole maximum is so far too complicated for me but as the first approximation let's consider what happens if the point moves in one direction only. For example, it moves by x and y[1]=y[0] and b[1]=b[0]. Then the expressions become simpler:

|| (x, y, b)[1] - (x, y, b)[0] || ^2 = (x[1] - x[0])^2

|| grad_f[1] - grad_f[0] ||^2 =
= (2 y^2 (x[1] - x[0]))^2 +
+ (2 (x[1]^2 - x[0]^2) y + 2 (x[1] - x[0]) b )^2 +
+ ( 2 (x[1] - x[0]) y )^2 =
= (2 y^2 (x[1] - x[0]))^2 +
+ (2 (x[1]^2 - x[0]^2) y )^2 + 8 y b (x[1]^2 - x[0]^2) (x[1] - x[0]) + ( 2 (x[1] - x[0]) b )^2 +
+ ( 2 (x[1] - x[0]) y )^2

When we divide the squared norm of the gradient difference by the squared norm of points difference, we get:

L^2 =4 y^4 + 4 y^2 +
+ 8 y b (x[1]^2 - x[0]^2) / (x[1] - x[0]) +
+ 4 b^2 + 4 y^2

And there is a similar derivation for the other dimensions. This expression works out not so bad for |x| <= 1 and |y| <=1. In this situation we're guaranteed (x[1]^2 - x[0]^2) / (x[1] - x[0]) < 1, and all of the 2nd and 4th powers of y being < 1, so the result is limited to

L^2 <= 4 + 4 + 8 b + 4 b^2 + 4

and so

L <= 2 * sqrt(b^2 + 2 b + 3)

That's only one dimension for calculation of L, by x, but the other dimension by y is similar, and for dimension b the formula reduces to an even shorter one

L <= abs(2 * b)

So if we combine the dimensions (I'm not sure if this is the right thing to do but I guess close enough) the full L will be

L <= 2 * sqrt(3 b^2 + 4 b + 6)

This explains why in my small examples with very few neurons the training rate of 0.1 per round (divided by the number of training cases in a round) generally works well.

But if the weights are allowed to be over 1, then all of a sudden we get squares of large values, and L grows quite large, and 1/L becomes very small.

So I've tried to limit the weights to +-1, and lo and behold, the examples that used to be badly diverging, are now converging very nicely, and produce a decent approximation of a parabola with an error value quite close to a manually-built approximation.

I guess this means that much of the training speed can be gained with the right dynamic estimation of L. And I get the feeling that when we propagate the gradients backwards through the layers, and multiply them by the coefficients smaller than 1, this means that not only the gradients diminish but also the L value should be shrinking, and the learning rate of 1/L should be growing. But working this out will take more time and experimentation.

Monday, October 24, 2022

V-shaped NN activation function 2

I've got around to try the non-standard activation function I've described in the last post, I've named it "corner" function. And there is some bad news, some good news.

First, since I've added support for arbitrary activation functions to FloatNeuralNet, I've tried the "leaky ReLU" too. It's like ReLU but the left side has a small positive gradient instead of 0. On the weird task I've been playing with (simulation of a smooth quadratic function) it didn't work any better than normal ReLU.

Then I've tried my corner function, first just a single neuron. And the result was mixed. Sometimes it did work decently, sometimes it didn't. It didn't when the breaking point happened to be outside of my range of input values. Just as I suspected, there is no consistent force that adjusts the breaking point's position horizontally, If it gets into range of training data, it does something (not great either), but if it isn't in range to start with, it does pretty much nothing, and the function reverts to doing a linear approximation.

So I've tried to solve this empirically. I've thought of measuring the amounts of error on the left and right side of the breaking point. Then if we have more error on one side, this probably means that this side is more non-linear than the other side. And we could approximate it better if we had to deal with fewer sample points on the more non-linear side, so we have to move the breaking point toward that side, until some points shift to the other side and the balance returns. This sort of reminds me of AdaBoost that tries to use the result of a partial solver to balance the results by some criterion, but obviously it's not quite the same thing.

To give an example in beautiful ASCII art, in a situation like this, where the breaking point is marked with "@", the line produced by the neuron with "-" and the training points with "*":

                   * --
                   --
---- *           --
 *  ---- *     --
      * ---- --   *
           *@  *

shifting the breakin gpoint to the right would allow to do a better match on the right side without messing up the left side:

                   * -
                    -
---- *             -
 *  ---- *        -
      * ----     -*
           *---*-
               @

For the measure of the error we can use the absolute value of sigma (sigma is the error gradient propagating from the higher layer). It's absolute because we don't care if the point is above or below our computed value, that part would be sorted out by optimization of the other weights. And the thing that shifts the breaking point is the bias weight in the neuron. So if the value produced by neuron before activation is positive (i.e. on the right side from the breaking point), we can use the gradient with abs(sigma), if negative (on the left side) then with -abs(sigma).

The question is, how exactly to fold it into the gradient. Just using it as a gradient seems to have too strong an effect. Except when the neuron produces a signal that is always very close to 0, then it's not enough, because sigma becomes very small. Adding a small constant of the same sign solves this particular problem. But the rest of the time it still seems too strong. For that I've tried to take a fraction of the absolute sigma. Then I've tried to mix it with the traditional gradient. Both worked better, but then there is the question of what fraction to take and at what proportion to mix? I don't know yet how to figure it out, and I'm not sure that the empiric values transfer well between different examples. So the other way I've tried was to compute

gradient = output_before_activation * abs(sigma)

It seems to work decently well but it prioritizes the far away points over the near ones, and I'm not sure if this is the right thing to do, it's rather counter-intuitive.

With this, an one-neuron model works quite well. The problems start with more neurons - sometimes they do converge quite well, much better than what I've seen with ReLU, and sometimes not. One weird thing that happens is that the model goes converging quite well for a while, and then all of a sudden the error jumps up, then continues converging. Sometimes the training rate turns out to be too high, and the model explodes. Maybe those jumps in the other cases are also results of the training rate being too high at some points, which causes these jumps. Or maybe as the breaking point shifts and a training point ends up on a different side of it, that causes a strong jump. Or maybe both are connected, I don't know yet. Sometimes the training rate turns out to be too low - this generally happens when a neuron is driven to produce near 0, which tends to happen when the random seeding makes the V point upside down, and then as it tries to rights itself, it sometimes ends up with a combination of weights that shrinks the gradients. This is not a unique problem, this is something that can happen with any neural network. By the way, this time the variant that computes the gradient for the whole training pass before adjusting the model tends to be more stable than one that makes a step after each training case.

So where does it leave me? It looks like the whole thing can benefit from some momentum method to get out of the low-gradient situation, and from the auto-detection of the learning rate that is too high in the high-gradient situations. And maybe to think of some other way to prevent the wedging on 0s. The empiric solution for positioning the breaking point needs to be worked through the math, and maybe that would suggest a better solution too. And maybe the same way of adjusting the bias weight can be applied to the classic ReLU too (only the left side that generates 0 would probably have to use the sigma that didn't get zeroed by that yet).

Thursday, October 20, 2022

V-shaped NN activation function

In my previous post https://babkin-cep.blogspot.com/2022/10/optimization-7-follow-up-to-neural.html I've had the idea that maybe there should be a V-shaped activation function. After some thinking, I think I've got an idea of what this function could be. Let's call it A(x):

A(v) = {
    v < 0 : k1*v + k2;
    v = 0 : k2;
    v > 0: k3*v + k2;
}

It's not necessarily V-shaped as such, it just consists of two linear segments joined at a breaking point. k2 shifts this point vertically. Note that in the neuron formula (using the same variables as in https://babkin-cep.blogspot.com/2022/10/optimization-4-backpropagation-for-last.html):

N(v[]) = sum(v[i] * x[i]) + x0

the offset x0 also shifts the result vertically, but does that before activation, effectively shifting the breaking point left or right, while k2 does it after activation and shifts the result up or down.

If the values happen to be k1=0, k2=0, k3=1, then A(x) is equivalent to ReLU. But a major difference here is that k1, k2, k3 also get adjusted by backpropagation. It nicely fits into the backpropagation model: the function is not contiguous but has a derivative at every point, and this derivative is generally not 0, and it's also a linear function, so it can even be applied after the last layer of neurons.

Applying the same sample logic for the last layer as in https://babkin-cep.blogspot.com/2022/10/optimization-4-backpropagation-for-last.html, if we have

N(v[1..3]) = v[1]*x[1] + v[2] * x[2] + v[3]*x[3]

then after activation, assuming that f() < 0, the formula will be:

output = A(N(v[1..3]) = k1 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3]) + k2

Then the squared error will be:

f(v[1..3]) = (k1 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3]) + k2 - b)^2

and using the same logic as before, the partial derivatives will be:

df/dx[1] = 2 * k1 * v[1] * (k1 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3]) + k2 - b) =
= 2 * k1 * v[1] * (output - b)

df/dk1 = 2 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3])
* (k1 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3]) + k2 - b) =
= 2 * N(v[]) * (output - b)

df/dk2 = 2 * (k1 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3]) + k2 - b) =
= 2 * (output - b)

And when propagating to the previous layers, the propagation back through the activation function will multiply the sigma value by its derivative:

dA/dv = {
v < 0 : k1;
v = 0 : 1;
v > 0 : k2;
}

and the gradient (AKA derivative) for k1 or k3, depending on which path was taken, will be (N(v) * sigma), and for k2 it will be sigma.

Another interesting property of the V-shaped function is that it can implement a XOR logical operation on two inputs in one layer instead of the classical two. If we have false defined as -1 and true as 1 then we can define:

N(v[1], v[2]) = v[1] - v[2]

k1 = -1

k2 = -1

k3 = 1

It could also be argued that this activation function essentially adds an extra layer to every layer of neurons, but it's a special layer, much cheaper than a normal one. I think implementing an equivalent of V-shaped activation with plain ReLU would even require three layers to keep k2 common.

So it looks interesting and promising, although I'm not entirely sure that the breaking point will end up moving in the "intuitively correct" directions. Now I need to try and see. And first, reorganize the training code in FloatNeuralNet to support multiple kinds of activation functions, and factor out the repetitive code in it.

Monday, October 17, 2022

optimization 7 - follow-up to neural network corollaries

I was able to do some experimentation, so I have a follow-up to the last post. If you want to refer to the FloatNeuralNet code (and especially do it in the future when the code will have progressed forward), the current version is https://sourceforge.net/p/triceps/code/1733/tree/ .

First of all, it was bothering me that the indexing in the multidimensional training arrays of FloatNeuralNet was done all by hand and hard to keep track of, so I've changed it to use a more slice-like abstraction of ValueSubVector and ValueSubMatrix. And I've found a bug that was mixing up the first layer's weights pretty badly: an extra number that was supposed to be added to value was added to index, and read some garbage from memory (but not far enough to get out of the used memory and be detected by valgrind). Yes, a direct result of doing too much stuff manually. And it's been hard to detect because it didn't always show much, and with so many repeated computations it's rather hard to keep track of whether the result is correct or not. Fixing this bug removed the real bad results. Still, the results aren't always great, the square error for my test example on random seedings grouped into two groups: one at almost-0.09, one at 0.02-something, while starting with a manually constructed good example gave the error of 0.0077. (To remind, my basic test example is not very typical for the world of machine learning but tries to do an "analog neural network" that computes the square of a number between 0 and 1).

Next I've tried the reclaiming of unused neurons by just inverting the signs of weights, and setting the next-layer weights for its inputs to 0. With the bug fixed, this worked very well, pulling that weight out of 0 very nicely.

Next I've added a cumulative error and gradient computation to the main code (previously the error was computed by the test code). This introduced the methods for the start and end of the pass that initialize and summarize the stats for the pass, and methods for extracting the stats. As I have expected, the errors computed during the pass were mostly larger than the true error computed at the end of the pass, because of each training example of the pass receiving the weights in the state that's pulled by all the other training examples away from its optimum. But sometimes it also happens the other way around, I don't know why.

Computing the gradients and errors in a stable state is easy to do too: just do a pass with rate=0, so that the weights don't get changed. And then I've added a method to apply the computed gradient after the pass. So the whole procedure looks like this:

nn.startTrainingPass();
for (int i = 0; i < num_cases; i++) {
  nn.train(inputs[i], ouputs[i], 0.); // rate 0 only computes the gradient
}
nn.endTrainingPass();
nn.applyGradient(rate);

This actually does converge better than doing the gradient steps after each training case, most of the time. The trouble is with the other times: I've encountered an example where each step caused a neuron to become dead, and the training didn't work well at all. What happened is that the random seeding produced two dead neurons, and then after a reclamation, for all I can tell they ended up fighting each other: one pulling out of 0 pushed the other one into oblivion. The exact same example didn't exhibit this issue when doing the steps case-by-case. Which explains why the case-by-case updates are the classic way, they introduce the jigger that makes the gradient descent more stable in weird situations. BTW, you can mix two approaches too, and it works: just use non-0 rate in both train() and applyGradient().

Well, so far so good, so I've started looking into why do the random seedings produce two groups error-wise. I've even encountered a seeding that produced a sub-0.09 error but with the gradient going into 0, so it obviously thought that this is the best fit possible! It took me some time to understand what is going on, and then I've realized that all the sub-0.09 seedings produced a straight line dependency instead of an approximation of a parabola! And the 0.02-something seedings created a line consisting of either 2 segments, or 3 segments (which is as good as it gets with 3 neurons) but one segment being very short.

For all I can tell, the reason is that the network is designed to do a linear regression, and successfully does that, fitting a straight line as good as it gets. With the ReLU activation function (which is the only one I've tried so far), the nonlinearity is introduced by the "breaking point" that cuts off the values below 0, but there is nothing at all that optimizes this breaking point. It's position is merely a side effect of the multipliers and offset weight, and those get opimized for the position of the straight line, the breaking point becoming just a side effect. So this breaking point gets determined purely by the random seeding. I guess this might not matter for many of ML applications when the straight lines are good enough and there are many more possible combinations of input signs than there are neurons to make even the straight lines. But it explains why some kinds of ML models (if I remember right, the text processing ones) work with a sigmoid activation function but not with ReLU, even though theoretically sigmoids can be generated from ReLU (this is really the point of my tiny test model, can you generate an approximation of smooth function such as sigmoid with ReLU?). I'm not even sure that sigmoid works that well, since there the line has two (smoothed) "breaking points", which for all I can tell are not directly optimized either but positioned as a semi-random side effect of other optimization pressures. Maybe a way to combine the best features of ReLU and sigmoid would be to have a V-shaped activation function where two halves of the V get split off and fed to different inputs in the next layer, and the position of the center of the V gets directly driven by its own gradient. Yes, this would increase the cost of the NN, but only linearly, not quadratically, as increasing the number of neurons per layer would. But I have no answer yet on how exactly to do this, and whether it's possible at all.

Wednesday, October 5, 2022

optimization 6 - neural networks corollaries

Writing up the mathematical derivation of the backpropagation got me thinking about what else can be done with it. So I've decided to write these ideas down before I forgot. Not all of them, or maybe none of them, might pan out, I haven't tried yet. What bothers me is that some of these ideas should give some substantial improvements, and they're not that hard to think of, so if they do give substantial improvements, someone would have already thought about them and they would be used all over the place. But I think they aren't. So either they're wrong and don't give the improvements, or maybe they are used and only I in my ignorance don't know about it, or maybe they really are new and worth trying.

First, I think I've been wrong thinking that doing a gradient descent step based on all the training data is too much work. If we're going to minimize the error from all the training cases, then the function to minimize will be the sum of errors from all the cases. And if we want to compute a gradient, the gradient of the sum is equal to the sum of gradients of individual parts. And we can apply the exact same principle to the whole backpropagation. So doing this compound step won't even require that much memory: just a single gradient value keeping the sum for each weight. We'd backpropagate as usual, and as we compute the gradients, we will add them to the sums. But not change the weights yet. Only after a whole pass of training is done, we'd use the computed gradients to change the weights once.

But why bother with it? Well, there are methods to select a larger gradient step rate that will get us closer to the goal in one step (such as ISTA with its Lipschitz coefficient that is used for the estimation). Just scale it down by the number of layers, to account for each layer trying to do its own equal share of adjustment. And then we can also use the momentum methods, such as FISTA (it stands for Fast ISTA) to accelerate further, for each neuron. If this works right, this should accelerate the training big time, at least for the ReLU activation function that doesn't mess up the gradients much. So if nobody uses it, there should be something wrong with this idea. It will be interesting to try. Maybe something does get lost when many small steps get replaced by one big cumulative step.

Potentially, a larger ISTA-like step rate would work on the small individual steps too, just scale the steps down more, by the number of training cases. However there is a catch in this approach: if each step takes us closer to the optimum of one training case, then the last training case will have the strongest effect, while the effect from the first training case will be most diluted. Adding them up equally, trying to do its step from the same initial position, evens them out. If the steps are small anyway, this difference doesn't matter, but if we're aiming for the larger steps, maybe it would. On the other hand, if all the cases are pushing the outputs in about the same direction anyway, doing many small separate steps would follow the gradient curve better than one large step (because, remember, we would see the zigzags with the large steps). And if we do many training cases, and repeat them in many passes, there would not be any relative advantages between the cases, they would be constantly trading spaces, the first one again becoming the last one (except on the last pass). So maybe that's why nobody accumulates the gradients. But even then I still think that a larger Lipschitz-based rate can be used, and maybe even the momentum can be carried between the passes.

The cumulative gradient for the last layer can also be used to estimate one more thing: how close we are to the optimum, and so whether it's time to stop. The squared error itself can also be used for this estimation but the trouble with it is that we don't know, how close to zero can it ever potentially get for a given configuration of a neural network. If there are many more input data than neurons (which is usually the case) then the fit won't be perfect, and we don't know in advance, how good could it be. The gradient in the last layer will be gradually dropping (more or less) as we're getting closer to the optimum, so it would be a good indicator of how far we still are. Together the gradient and the squared error can also be used to detect overfitting, when the error near optimum becomes too small. Well, maybe the cumulative gradient for the last layer, whose parts are generated not all from the same starting point but over one pass of the moving network, would work just as good.

Another problem with a possible solution is what if the random seeding of the weights generates two neurons in the same layer that are very close to each other, or even identical? They will try to stay close, and together will be a waste of space, something I've seen on my very small test examples. The solution might be that we don't have to use the same step rate for all the neurons. If we use a slightly different rate for different neurons in the layer, it would gradually push them apart, and the farther apart they get, the more should be the difference in the pressures from the gradients, so this should naturally spread them in a more useful way. But each one would still be pushed towards achieving the global optimum, just at a different speed.

This got me thinking about yet another problem, of what if a neuron with ReLU becomes dead? I.e. none of the inputs produce a positive value, and after ReLU activation, the neuron always producing 0. One possibility that I didn't notice before is that the changes in the lower-level neurons might eventually resurrect the dead neuron in a natural way by changing its inputs. Except for the neurons of the level 0, these will never see the changes of inputs. So maybe it's worth waiting longer for the higher-layer neurons to become resurrected naturally before resurrecting them forcibly by reinitialization.

And it's been bothering me that I did this reinitialization by generating new random numbers, which is both unpredictable, and as an immediate effect messes up the quality of the whole network. So maybe a way to avoid the unpredictability is by adding the weights from another neuron at a fixed relative position to this one. They've been both random to start with, so now we're adding together pairs of random numbers, which should still give random numbers. If the technique above for separating the equal neurons works well, it can even be done without disturbing the quality of the network: copy the values from another network of the same layer, and split the weight corresponding to that other neuron in each neuron of the next layer into two parts (either equal, or maybe better not) and put the other part into the weight corresponding to the resurrected neuron. Initially they together will produce the same input to the next layer as the old neighboring neuron did, and then hopefully the backpropagation pressure will be pushing them further apart.

Or yet another solution to the unpredictability might be to simply change sings of all the weights in a dead neuron, then it will be firing on all the inputs! And to avoid disruption, set all its connected weights in the next layer to 0.

Yet another problem seemed to be that the gradient is proportional to either the input value (for the gradient of weight), or to the weight (for the input value), and so if one becomes 0, the other one stops changing. If both ever become 0, they get stuck there forever. But if at least one of them is not 0, it might be able to push the other one out, which would also provide a positive reinforcement to itself, so they would bootstrap up and it really isn't a problem. But maybe this doesn't bootstrap too well, or people would not be recommending to make the initial random values larger. So maybe a solution there would be to forcibly change a value that is too close to 0 to a larger one for the purpose of the gradient computation. It will overestimate the gradient, but that will either be a good thing, or the next pass (or even step) will restore equilibrium by pushing the values back to 0. And we can also keep some hysteresis, and pull the gradient artificially up only if it would move the point in the same direction as before.

A kind of long list of things to try, next I'll need to find time for it all!

Monday, October 3, 2022

optimization 5 - backpropagation further backwards

In the previous post we've got the last layer, how do we backpropagate from it? Backpropagation fundamentally means, how do we modify the preceding layer to change the vector v? Which can be done by doing a gradient step on v. So they turn the expression around and say that now v is the variable and x is the constant (yeah, my attempt from the last post to avoid having x as a constant was ultimately in vain, but it got us through a good deal of description). Applying the symmetrical logic we get:

df/dv[k] = 2 * x[k] * (output - b) )

Note that here x[k] has the value from before it has done its own gradient step. The estimation of the gradient by v for the next step can be improved a little by using the new value of x and recomputing the new value of output. But since the step is small, it probably doesn't matter much, and computing the new output value is extra work.

In FloatNeuralNet code this happens in the line

wsigma += weights_[weightpos] * s_next[i];

where s_next[i] was previously set to (2 * (out - outputs[i])). The addition is there because of the next part: each neuron in the inner layers affects multiple neurons in the next layer.

To show what is going on, let's consider a small example with a neuron of a layer-before-last with 3 inputs, that feeds into two neurons of the last layer, in beautiful ASCII art:

                                        +--> *x[1][1]----V
                                        |    *x[1][2]--> sum = f1 -> -b[1]
u[1] -> *y[1]----V                      |    *x[1][3]----^
u[2] -> *y[2]--> sum = s -> act = v[1] -+
u[3] -> *y[3]----^                      |
                                        +--> *x[2][1]----V
                                             *x[2][2]--> sum = f2 -> -b[2]
                                             *x[2][3]----^

Here I've used the variable names u and y for the input and weights of the new neuron. After summation it produces the value s, and after the activation function the value v[1]. This value v[1] is then fed as the first input of the last-layer neurons. The other inputs don't matter here, so they are not shown. The weights of the last-layer neurons are still x, but now with an extra index to tell to which neuron do they belong. The results of the last-layer neurons are f1 and f2, that are compared with the training outputs b[1] and b[2].

Let's consider for now that the activation function just passes its argument through (such as ReLU for the positive values), and so v[1] = s.

The full error (AKA loss) function f that we minimize will be

f = f1 + f2

The partial derivative by v[1] (AKA the first dimension of the gradient of f) will also be a linear combination:

df/dv[1] = df1/dv[1] + df2/dv[1] =
= 2 * x[1][1] * (output[1] - b[1]) ) + 2 * x[2][1] * (output[2] - b[2]) )

So that's where the addition comes from, the derivatives from the last-layer neurons are added up to get the derivative of the whole function that we're minimizing.

In the inner layers there are no squares on the outputs, so the derivatives are much easier to compute. If the activation function is pass-through then:

dv[1]/dy[k] = u[k]

If the activation function is not pass-through, its derivative gets included simply as a factor:

dv[1]/dy[k] = u[k] * (dv[1]/ds)

In FloatNeuralNet code this translates to the lines:

wsigma += weights_[weightpos] * s_next[i]; // in an inner loop
s_cur[prev] = der * wsigma;
// s_cur gets moved to s_next for the next iteration on the preceding layer
Value gradient = input * s_next[i];
Value newwt = saturate(weights_[weightpos] - rate * gradient);

And then it continues propagating in the same way through more and more layers.

An interesting consequence of all this is that on each layer there are two gradient descents going on: once by adjusting the weights of this layer, once by adjusting the previous layer. So if we really wanted to compute the perfect match for the weights of this layer and did it, then right after that the adjustment to the previous layers would overshoot the minimum towards the other side. So it's actually good that the descent rate value is on the small size, and it gets amplified by the number of layers:

The last layer descends by rate, and passes the gradient to the previous layer;
The next-to-last layer descends by rate, and passes the gradient to the previous layer;
... and so on..

So the effective descent rate is really(rate*n_layers) , which means that the shallow neural nets can use a higher rate value than the deep ones. Well, I guess a counteracting force might be that on the way back the gradients might be getting more and more muddled and closer to 0, so the same rate in the earlier layers will produce a smaller effect. Apparently, that's why in the very deep networks (over 100 layers) they make some layers double-connected, feeding both the next layer and another layer that is maybe 20 or 50 layers later, to bring in the stronger gradients. I wonder if a similar effect can be obtained by using a larger descent rate in the earlier layers.

optimization 4 - backpropagation for the last layer

As you might have guessed already from discussion of gradient descent, there is a direct relation between non-linear optimization and backpropagation. Previously I've pointed to https://www.researchgate.net/publication/266396438_A_Gentle_Introduction_to_Backpropagation as a good intro reading. Having the background optimization and looking at the neural network training problem as an optimization problem can explain, where does all this logic come from. Here is this explanation:

First, about the variable names. Traditionally, the linear regression stage in a neuron is described as:

output = sum(x[i] * w[i])

where I'll use the programming-like square-bracket indexes instead of the math-like subscript indexes; x is a vector of inputs and w is a vector of weights. By the way, what we call in programming the indexes or elements of a vector, in mathematics they call dimensions, sometimes not differentiating between the index and element, and using the word "dimension" for both. I'll use this word sometimes too.

The thing though is that when doing the training, in this naming x is a constant and w is our variable that we're trying to optimize to achieve the training. Mathematically, it doesn't matter which letter is used for what, but I personally am used to x being a variable, so seeing it as a constant confuses me and makes me make more errors. So I'll use a different assignment of the names:

output = sum(v[i] * x[i])

where v is a vector of inputs and x is a vector of weights.

And let's ignore the non-linear activation function after the linear regression at the output side of the neuron for the moment. My experience shows that you can't use the ReLu function at the output of the last layer anyway, because the negative side won't backpropagate properly, and the Gentle Introduction also says that the activation function is not used after the last layer (and before this post is over, you'll see why). Thus let's start the consideration from the last layer that doesn't have this extra complexity.

We want to train the neuron to minimize the squared error on all the examples. I.e. if we have a set of N training examples, so that j-th example is (v[j], b[j]) where v is the vector of inputs and b is the expected output, then we want to minimize

min(f(x)),
f(x) = sum{1 <= j <= N}( (ouput(v[j], x) - b[j])^2

and if we expand the ouput() function and assume that the neuron has K inputs, that expression becomes:

f(x) = sum{1 <= j <= N}( (sum{1 <= i <= K}(v[j][i]*x[i]) - b[j])^2 )

If we ignore any boundaries, minimization would require just finding the values of x where the function is minimal, and since the function is quadratic and the shape of the curve will be a multidimensional parabola, there will be exactly one minimum that will be the global minimum. At this minimum the gradient (i.e. the multi-dimensional derivative) will be equal to 0, so it's even possible to find the minimum analytically: each dimension of the gradient will be linear (because the original function is quadratic, and a derivative loses one power), so equaling it to 0 will produce a system of linear equations that can be solved.

Each dimension k of the gradient (the proper name for it is the "partial derivative of f() by x[k]") will be:

df/dx[k] = sum{1 <= j <= N}( 2 * v[j][k] * (sum{1 <= i <= K}(v[j][i]*x[i]) - b[j] ))

where j is the index of the training example, i is the index of the input of the neuron, and k is the index of the gradient's dimension (which has the same range 1 <= k <= K as the index of inputs).

An exact solution of a system of linear equations might be not so bad complexity-wise for a small number of inputs, but can get much worse for a large number of inputs. In that case it can be approximated by a gradient descent method, and with momentum methods it can be not so bad, say 100 steps of descent, each of quadratic time complexity, to get a decent approximation. The trouble though is that gives the perfect solution for only one neuron of the last layer, and then we'll do the backpropagation to the previous layers, and the inputs v will shift, and we'll need to do the same thing again. And also while we compute it, we'll need to remember the outputs of all the layers for all the examples to do the backpropagation. So when looking at the whole training problem, the complexity balloons very fast.

Instead people have come up with an inexact approximation. They said, let's do one training case at a time, it will give us some gradient step, the next case will give some other gradient step, and after all the cases are done, we'll get somewhere reasonably close to where we'd have been with the perfect soluton. And a small difference won't matter much because the inputs v will be shifting on the next step anyway. And we'll need to keep much less state at a time, just the outpus of all the layers for a single case, so every which way it's much cheaper. Because of these arbitrary wanderings in the middle of each training pass, they call it the stochastic gradient descent.

So OK, since we're looking at only one training case now, let's drop the index j, and the functions become:

f(x) = (sum{1 <= i <= K}(v[i]*x[i]) - b)^2

df/dx[k] = 2 * v[k] * (sum{1 <= i <= K}(v[i]*x[i]) - b) )

I personally find the more concrete examples easier to understand and compute, so here is how I came up with the formulas above: I started with an example of a neuron with 3 inputs (i.e. K=3). Then the specific formula will be:

f(x) = (v[1]*x[i] + v[2]*x[2] + v[3]*x[3] - b)^2

To compute the gradient dimension, we'd need to expand this square, but we can note than in taking the derivative each term that doesn't contain x[k] will disappear, so we need only to bother with the terms that contain at least one x[k]. So for example for k=1 we can use the function approximation

f(x) = v[1]^2*x[1]^2 + 2*v[2]*x[2]*v[1]*x[1] + 2*v[3]*x[3]*v[1]*x[1] + (...)

df/dx[1] = 2*v[1]^2*x[1] + 2*v[2]*x[2]*v[1] + 2*v[3]*x[3]*v[1] =
= 2*v[1] * (v[1]*x[1] + v[2]*x[2] + v[3]*x[3])

And the same symmetrically for the other dimensions of x.

Alright, we've got the gradient. We can do one more simplification. Note that the sum inside it is equal to the output of the linear regression:

output = sum{1 <= i <= K}(v[i]*x[i])

So we can use it to compute the gradient:

df/dx[k] = 2 * v[k] * (output - b) )

After the gradient is computed, applying it is trivial. In the FloatNeuralNet code https://sourceforge.net/p/triceps/code/1707/tree/trunk/cpp/nn/FloatNeuralNet.cpp the gradient values are computed in two steps, first:

s_cur[i] = 2 * (out - outputs[i]);

then in the next loop s_cur gets moved into s_next and

Value gradient = input * s_next[i];
Value newwt = saturate(weights_[weightpos] - rate * gradient);

Another thing to note is that this computation of gradients is valid only for the simple quadratic formula. If we were to insert an activation function after the linear regression, all the math will change, and I guess computing the gradients with included sigmoid is too much work.

And the third thing to note is that batching as it's normally done is not equivalent to computing the minimum on the whole training set. Batching combines the training examples, and hence the errors produced on them, linearly. The minimization on the whole set combines the errors quadratically. Although I guess nothing really prevents us from doing a small number of training examples separately and then combining them quadratically. It would be a quadratic batching. But I'm not sure if it's worth the trouble. If nobody does it, it probably isn't.

By now this post is growing too large, so I'll talk about the further backpropagation in the next post.

Wednesday, September 28, 2022

optimization 3 - non-linear programming and gradient descent

Then go the "non-linear programming problems" (one more source of the acronym NLP, and with an even more meaningless use of the word "programming"). Here the non-linear function itself can have multiple minimums and maximums, plus there still is the boundary. The boundary, by the way, can now have non-linear inequalities too, and is the more difficult part of the problem. So the problem adds the part of finding the function extremums and deciding if they are in the right direction and inside the boundaries, while the part of finding the lowest (or highest) point on the boundary still remains similar to the linear programming, only is more difficult now. Now there is no easy assumption that we're always going in the same direction downhill along the boundary, so the solution is to find all the interesting points, compare them, and find the global extremum(s). And people have figured out how to find these interesting points on the boundaries, it's called the "KKT conditions" (KKT are the initials of the three authors). It's essentially looking for where the function is going down (for minimization) when nearing the boundary, and its gradient is exactly perpendicular to the boundary. The "exactly perpendicular" gives the lowest point. Although the corners where multiple boundaries intersect represent the whole range of gradients that lies between the gradients of boundary functions intersecting at this point, and the function matching any value from this range indicates a local minimum. But this all requires analytical computations of gradients and second gradients (AKA Hessians), and then analytically solving systems of inequalities on them, and none of this is easy to do.

So here enter the methods like the gradient descent. The idea there is that if the function has no local minimums, we can find the direction in which the function grows the fastest (i.e. the gradient) and go in the opposite direction to find the global minimum (if there are multiple local minimums, we would find one of them, not necessarily the best one).

Here an interesting question is, how far should we go on each step? If we go too far, we'll overshoot the minimum. If we go not far enough, we'll have to make a lot of steps, computing the gradient on every step, which is slow. And here is the most interesting part: even if we take the perfect step that gets us to the minimum in the direction of the gradient (which is possible to compute analytically for the quadratic functions using the Newton's method), that generally won't be the global minimum. As it turns out, the gradients in general don't give the direction to the global minimum. It gets us to some "trough" that leads towards the minimum but usually this trough is curved. And moreover, the local minimum in the direction of gradient (i.e. if the original function represents a 3-D shape, cutting in the direction of the gradient would give us a 2-D vertical slice) won't even be at the "bottom of the trough". So then if we take the second perfect step after the first perfect step, it will go at a right angle to the first one (there is a reason and a proof to this) and overshoot the "bottom of the trough" again. So the path becomes a zigzag, with smaller and smaller steps as we near the minimum (and never quite reach it).

How about we take the direction from two steps together and continue there? As it turns out, this still won't necessarily get us to the global minimum either, if the trough curves as it usually does. But we'll end up with a longer step that will be followed by a shorter perpendicular step, ad so on. We can try to correct by adding up again, which will keep the steps in the continuation direction longer again. This is the idea behind the "momentum gradient descent", where we keep the "momentum of inertia" from the previous steps, sort of imitating a ball rolling down a hilly landscape. It still will be making shorter and shorter meaningful steps as we get closer to the minimum (as the ball slows down when it nears the low spot), but at least it auto-adapts by growing the step size as necessary at the start, building up the longer steps out of the too-short fixed ones, so we can skip the complex computations of the Newton method and just err on the short side when selecting the step size. The momentum methods still work better with the basic step size that is closer to optimum (very short basic steps cause it to "circle the drain" around the optimum a lot), but still the difference is not as tremendous as with the plain gradient descent.

optimization 2 - solving the linear programming

The classic way of solving the linear programming problems is the Simplex Algorithm. It basically starts with an arbitrary corner (a tricky part is to find any starting corner) and then goes ("pivots") from there to the next neighboring corner that is lower, and so on, until it finds the lowest corner. Only it's all done as computations on vectors and matrices. It also turns out that as we get more than 2 dimensions to intersect with the plane of the function, there can be multiple corners at the same level that is not optimal, while the whole shape still is convex. So it may take the algorithm some time to sort out though these equal corners before finding one that las a connection to a corner going further down, or it may even get stuck going in circles between the same corners (but it will know that it's not at an optimum yet). For some reason they didn't talk about the obvious solution for going in circles: remember, which corners have been already tried at the current "plateau" and don't try them the second time, try the different ones (and yes, you can see, which corners are part of the plateau, they will have the key metric equal to 0).

And as it turns out, each LP problem has a related "dual" problem that it can be mechanically converted to (and it's symmetric, the second conversion, the "dual of the dual" will give the original "primal" problem). Mathematically, the dual problem is produced essentially by transposing the matrix that represents the set of inequalities, "turning it sideways", so that each variable in the primal problem is converted to an inequality in the dual one, and each inequality in the primal problem to a variable in the dual one. Logically, the relation between the problems is like this: A typical (um, "standard form" if I remember right) LP problem can also be represented as a graph of pipelines (or trade routes) of certain capacity, with the question of how to maximize the flow between two given points by using all the possible intermediate paths to the maximum of capacity - how much do we sent through each pipe? The dual problem for it would be, if we want to cut the connection between these two points completely by blocking the pipes with least possible total capacity, which pipes do we block?

Both sides of the problem are about finding the bottleneck, the primal about using it to the max, the dual about blocking it, so if you find the bottleneck in one way, it gives you the solution for the other way too. And it turns out that the dual version may be easier to solve: the major difficulty for Simplex Algorithm is finding any valid corner as a starting point, and the conversion from primal to dual provides this starting point for the dual problem. Hence the Dual Simplex Algorithm.

Saturday, September 24, 2022

optimization 1 - linear programming

This year my daughter was taking a college class on optimization, so I've had a look too. No, it's not about the software optimization, it's about optimization as a math discipline. And I've learned a good deal of new things that I want to share (and also write down a summary for myself, before I forgot too much of it). The reason why you'd want to care about it is that it has an application to the neural networks, the gradient descent algorithm used for their training is one of the optimization methods, and there are interesting consequences to think about. I'll get to that part eventually (it will take multiple posts), but first things first.

They've started with the "linear programming" problems. I've been thinking "oh, I know this stuff, it's all recursion + memoization". But no, it turns out that they use this name for something completely different. In programming we call "linear programming" the discrete problems where we need to find a best-fitting combination of the discrete units. Not much programming but at least some in all that recursive trying. What they do in optimization is they call "linear programming" the analog problems where the expressions are linear.

So you have a number of variables - call them x, y, z, ..., or x_1, x_2, x_3, ..., or a vector x (this subject is ripe on vectors and matrices). And you have a function on them that you want to maximize or minimize. A linear function (that's the source of "linear" in "linear programming") would be:

c_1 * x_1 + c_2 * c_2 + c_3 * x_3 + ...

Well, the solution would be easy, to minimize it, send all the values of x to minus infinity (or if some c < 0 then those x to plus infinity), right? But the problems also come with conditions on x, given as inequalities, which in the "linear programming" are also linear expressions. A simple example would be

x_1 > 0

or a more complex example

3 * x_1 - 5 * x_2 < 5

So instead of going to infinity, we'll hit the boundaries of conditions, and the problem is pretty much about which boundaries will be most restrictive.

In visual terms, the conditions form a polyhedral shape (which might be enclosed or open on some side). In a 2-dimensional form (with only x_1 and x_2) we can represent it as a 2-dimensional fence, and then the function will be some inclined plane. If we build a model of this out of plywood and let a ball roll on the plywood sheet that represents the inclined plane of the function, the ball would stop against the fence at the lowest point and solve the minimization problem (BTW, the minimization and maximization problems can be turned into each other by simply changing the signs on everything; including turning around the "less" and "greater" signs on the boundaries). There would always be a corner (or maybe a whole straight line) that is at the lowest level. Or if the fence happens to be open at the lowest point, the ball would roll out to "minus infinity", that's called an "unbounded problem". It could also be that the conditions contradict each other, and when we try to bulid the fence, we could not do it - that's a "problem with no solution". But since each condition represents an infinitely long straight line, the fenced area cut off by all these lines is guaranteed to be convex (a "convex polyhedral"), with no strangely-shaped nooks and crannies.

In the full multi-dimensional form the conditions represent the planes in that multidimensional space, that cut and split that space, and form this multidimensional "fence". This fence would still be a convex polyhedral, and the solution would be where the condition plane intersects this "fence" at the lowest point in the condition dimension.

For solving this, they have two slightly different "standard" forms of presenting these equations, one is called the "standard form" and another the "canonical form". I guess, two different people approached these problems differently, so now there are forms for both approaches, but the problems can be converted easily between these forms.

Tuesday, July 12, 2022

small servers

Pretty regularly we (the programmers in general) do make the small servers that sit on some random port and execute commands from the clients. With no security. We rely on the obscurity of the randomly (but not really randomly, since we usually choose it explicitly) chosen port, and usually also on the relative privacy of the development machines for security. And ideally, firewalls that forbid connections to random ports from the outside. But what if the machine is not very private (such as the tool gets run on a production machine), and the firewall doesn't forbid the connection? Someone else might connect. Then of course there is the obscurity of the tool's protocol, but if the tool becomes popular, its protocol becomes widely known. And then if the tool can access files, it easily becomes a target for attacks. So a little obscure tool can outgrow its breeches and become a security hole.

But it seems to be not that hard to fix with a helper library that would check the password on connection. It can automatically generate the password and put it into a file, and a client would read it from the file and supply on connection (like .Xauthority does). Maybe there even is already an existing library that does this?

Saturday, July 9, 2022

constants in Perl

When writing Triceps code, I've been whining that it would be great to have symbolic constants in Perl, this would allow to identify the fields of Triceps records by symbolic names that would translate at compile time into integers. Well, it turns out that there are constants in Perl and have been for more than a decade now. They've been there when I've started writing Triceps, just I wasn't aware of them. They're used like this:

use constant X => 1;
use constant Y => 2;

$a[X] = $a[Y] + 1;

They're referenced by names without any prefix, and can be scoped in packages and exported from them like everything else. Duh.

Thursday, May 12, 2022

Jerry Baulier

I don't often look at Linkedin, and even less often look at their feed (it's extremely boring), but I've opened it today and right at the top of the feed there is the news that Jerry Baulier who was the CTO of Aleri has died on April 29th. Apparently he had retired from SAS last Fall.

Sunday, March 6, 2022

first attempts at training FloatNeuralNet

After experimenting with the "intelligent design", I've tried out training the same network from a random state to do the same thing, computing the square function. Yes, it's not a typical usage, but it's a surprising example, I think automatically approximating an analog function is quite impressive.

And it worked. Not every time, depending on the initial random state, but it when it worked, it worked quite decently. Usually worked better with more neurons. My guess is that with more neurons there is a higher chance that some of them would form a good state. Some of them end up kind of dead but if there were more to start with, it's more likely that some useful number of them will still be good. The ReLU function I've used has this nasty habit of having a zero derivative in the lower half, and if a neuron gets there on all inputs, it becomes dead.

Which brought the question: how do we do a better initialization? Yes, I know that the bad training from a bad random initialization is normal, but I wonder, could it be done better? I've found the article: https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/. Initialization-wise it boils down to initializing the weights between the neurons in range +-1/sqrt(num_of_inputs), and initializing the bias to 0.1. There is also an idea to consider the ReLU derivative in the negative side as a small negative value but I don't want to touch that yet. The square root probably comes from the variance of the standard distribution. The bias of 0.1 "pulls" the value up, so that even if the weights are mostly or all negative, there is still a chance of the neuron producing a positive output that would pass through ReLU. This did improve things, but only slightly, and still pretty regularly I would get a bad training.

The next thing I've noticed was that things tend to go better if I do the first few rounds (where presenting the whole set of input data is a round) of training at a higher rate. The recommendation from the article on backpropagation was to use the training rate of 1%. I've found that if I do the first few rounds with the rate of 10%, I get fewer bad trainings. I've eventually settled down on the first 50 rounds out of 1000 using the high rate.

BTW, I've been worrying that when the weight gets initialized with a certain sign, it would always stay with the same sign, because of the ReLU having a zero derivative in the lower part. The experiment has shown that this is not the case, the weights do change signs, and even not that rarely. I still need to look at the formulas to understand, what did I miss in them.

What next? It would be good to reclaim the dead neurons and put them to a good use. Suppose we do a round of training with the whole training set. If a neuron never produces a positive value on the whole training set, it's effectively dead, and can be reinitialized by randomization. So I've done this. It does add some overhead but most of it happens infrequently, at most once per round. The only overhead to the normal training is increasing the usage counters. Observing the effects, he weird part is that it's not like "once you get all the neurons into a live state, they stay alive". Yes, the dead neurons tend to be found near the start of training but not necessarily on the consecutive rounds, and once in a while they pop up out of nowhere on a round like 300 or 500. I've added the reclamation of dead neurons after each round, and this made a big improvement. I was even able to shrink the network to the same size as I used in the "intelligent design" version and get the good results most of the time, normally better than from my handmade weights (although not as good as handmade + training). But not all the time.

Looking at the weights in the unsuccessful cases, the problem seems to be that it happens when all the neurons in this small model get initialized too similarly to each other. And then they get pushed by the training in the same direction. This seems to be the reason why the larger training rate on the early rounds made an improvement: it helps the neurons diverge. What could I do to make the neurons more different? Well, they got the fixed bias weights of 0.1. How about changing that, since I've got another way of reclaiming the dead neurons now? I've made the bias weights random, and this reduced the frequency of bad cases but didn't eliminate them.

Another thing commonly seen in the bad trainings seems to be that the weights in level 1 tend to be pushed to a low absolute value. Essentially, the NN tries to approximate the result with almost a constant, and this approximation is not a very good one. This drives the weights in the second layer up to absolute values greater than 1, to compensate for that constness. How about if I won't let the weights get out of [-1, 1] range, will this push the lower layer to compensate there? It did, the weights there became not so small, but overall the problem of the weights driven towards 0 is not solved yet in my example. In an article I've read in IEEE Spectrum, they re-train the networks to new input data by reinitializing the neurons with low absolute weights. So maybe the same thing can be done in the normal training.

P.S. Forgot to mention, I've also changed the initialization of weights to be not too close to 0. With the upper limit on the absolute value limited by the formula with square root, I've also added the lower absolute limit at 0.1 of that.