Sergey Babkin on CEP and stuff: neural

Showing posts with label neural_nets. Show all posts

Sunday, June 1, 2025

momentum stochastic descent

Attending the NeurIPS conference last year gave me the idea that I should combine the momentum descent with stochastic descent (people already do that), and then try my changes on that. There is a bit of a problem with how to compute the gradients without incurring twice the overhead, but I've realized that it can be approximated in an easy way, by averaging the gradients from the steps of stochastic descent. Yes, they would be computed at different points in the vicinity, but perhaps close enough. So I've done a little bit of code refactoring to make it a little more structured, and then implemented the momentum on stochastic descent.

The good news is that the momentum descent works pretty well out of the box, producing about half the error in the same number of steps as the plain stochastic descent.

The bad news is that stopping the momentum in the dimensions where the gradient changes sign doesn't work. Well, it still works better than plain stochastic descent but not as good as plain FISTA + stochastic. How about not stopping but reducing the momentum, multiplying it by some coefficient less than 1? The result is about proportional: 0.1 is close to just stopping, 0.9 is close to plain FISTA, and 0.5 is about half-way. Which I think means that the braking definitely messes things up, and it's a wrong-shaped function for braking this way. There is a huge number of dimensions that experience the gradient sign flipping on my examples, like half of them, and evidently it's completely OK.

But maybe it can still be used to adjust the step size. We'll see. I'll need to analyze and understand more what is going on in there.

Sunday, December 22, 2024

Triceps and realSEUDO at NeurIPS

I've just come back from the NeurIPS 2024 conference. With Triceps listed as my organization. I've done a project with mentoring my family member at JHU, and well, since it's not work-related, using my day job would be wrong, and I'm not at JHU either (and I've paid for attending the conference myself, JHU turned out to be too cheap for that). So Triceps it is, especially that Triceps is used in the project. And that's, by the way, exactly the project I've mentioned before as an application of Triceps, just I couldn't tell the details before it got published. Well, now we've had a poster session at NeurIPS (which people tell me is kind of a big deal, I've just had fun), and the paper is out: "realSEUDO for real-time calcium imaging analysis" at https://neurips.cc/virtual/2024/poster/94683 or https://www.researchgate.net/publication/380895097_realSEUDO_for_real-time_calcium_imaging_analysis or https://arxiv.org/abs/2405.15701. Here is what it does, in simple words.

The calcium imaging is a brain imaging technique, in live (although not exactly unharmed) brain. People have bred a strain of mice whose neurons light up when they activate. So they take a mouse, open up its skull, and stick a camera with a microscope on it. This allows to record a movie of a living brain where you can see what is going on in it literally cell by cell, the brightness of the cells corresponding to their activation level.

Once you get the movie, you need to analyze it: figure out where the cells are, and their activation levels. The "golden standard" of the detection algorithm is CNMF. It solves an optimization problem (in the mathematical sense, i.e. finding the minimum of a function that represents the error), for two mappings at once: what pixels belong to what cell, and what cells activate at what level in each frame (activation profile). More exactly, it deals not with separate pixels but with gaussian kernels. "Kernel" in the math speak is any repeatable element (so for another example, "CUDA kernels" are the CUDA functions that are repeated across the GPU), but in this case it's a sprite that gets stenciled repeatedly, centered on each pixel. "Gaussian" is the shape of the sprite, a 2-dimensional representation of the normal distribution where the brightness corresponds to the weight, so it's an approximation of a circle that is brightest in the center and then smoothly goes to nothing at the edges.

The trouble is that CNMF works only on the whole movie after it has been collected. You can't use it to decode frame-by-frame in real time because it needs to see all the frames at once. So you can't identify cells and then immediately use this knowledge to alter the stimulus and see what effect this has on the brain. There is another algorithm, OnACID that does frame-by-frame decoding using a variation of CNMF logic, except that it requires a starting period of 100 frames or so to get the initial cells, and then the quality of identifying the cells is much worse than with CNMF, and it's still not very fast, substantially slower than the 30 frames per second even on an 80-CPU machine we used.

In the same area but in a little differentt niche, SEUDO is the JHU professor's previous project, a technique to reduce noise in the activation profiles. It can be used together with the other algorithms. Aside from the random noise, there is an inherent cross-talk: the neurons are stacked in multiple layers and can have very long dendrites that can overlap with many other neurons, adding little weak light spots that get registered as weak activations on these other neurons. So the idea there is that to recognize the activations we both use the known cell shapes and also fill the image with gaussian kernels centered on every pixel, and solve the minimization problem on all of them, then throw away the values for gaussian kernels, thus discarding the cross-talk from unknown sources. The little light spots that don't fit well with the known cells get attributed to these gaussian kernels, and so the cell activation profiles become smoother. The problem is that it's slow, and quickly becomes slower as you increase the size of the kernels. If you take a kernel of 30x30 pixels, you get extra 900 coefficients in the quadratic equation you're trying to minimize (and that's for every kernel, which gets centered on every pixel).

The goal was to get this whole thing to work in real time (i.e. at least 30 fps), and we've achieved it. It works in real time (even including SEUDO, which is really an optional extra processing on top), and produces the quality that is generally way better than OnACID, and fairly close to CNMF. Well, arguably, at least sometimes better than CNMF but the problem is that there is no existing accepted way to rate "better or worse" (the unscientific method is by eyeballing the small fragments of the movies), the accepted way of rating is "same or different from CNMF". I'll talk a bit more about it later, in a separate post.

So the work really has 5 parts:

The optimization (in programming sense) and parallelization in C++ of the optimization (in math sense) algorithm
Improvements to the optimization (in the math sense) algorithms
An off-shoot of trying to apply some of these improvements to optimization algorithms, and more, to the neural network training problem (the things I did in Triceps)
Improvements to the SEUDO algorithm
The completely new logic for automatically finding the cells in the movies (that got the name of realSEUDO)

The first part is kind of straightforward (read my book :-) with only few things worth mentioning. First, outdoing Matlab and the TFOCS package in it is not easy, it's already well optimized and parallelized, but doable. There are two keys to it. One is to match what TFOCS does: represent the highly sparse matrix of coefficients by "drawing" the sprites and thus regenerating the matrix on the fly whenever it gets used, otherwise the matrix just won't fit into the memory. The computation of a gradient requires two passes of drawing. TFOCS does it by generating two complementary drawing functions. We use two passes with one drawing function but computing different sums: the first pass goes by columns and computes the intermediate sums (common subexpressions) that get stored, then the second pass goes by rows and computes the final sums from the intermediate ones. The two passes with stored intermediate values reduce the complexity from O(n^3) to O(n^2) compared to doing everything in one pass. The second key was to make the pixel drawing function into a template rather than a function call, because the overhead of a function call in the inner loop has a higher cost than the rest of computation. And another thing that needs mentioning is that to pass data between Matlab and C++ code you have to use Matlab's C API. Their C++ API is extremely slow, adding much more overhead than the computation itself.

But that's still not fast enough. Then it was the turn of improving the optimization (in math sense) algorithms. There are three: FISTA which is a momentum gradient descent on top of ISTA which is the logic for selecting the step size for the simple gradient descent on top of LASSO which is the idea of adding a bias to the function being optimized to drive the dimensions with small values towards 0. The amount of this bias is adjusted by the parameter lambda (wait, we'll come to its effect yet).

I've started writing up the LASSO-ISTA-FISTA summary but even in a summary form it got quite long. So I'll put it into a separate follow-up post. The summary-of-a-summary is that a weird but serendipitous experiment gave me an idea of treating the gradient dimensions separately, and that led to the idea of the momentum stopping on overshoot by dimension. The momentum gradient descent is kind of like a ball rolling down a well, observed with a stroboscope: each step of a descent is like the passage of time to the next strobe flash. So the next time we see the ball, it might already be past the minimum and the momentum carrying it up the opposite wall. Eventually it will stop, go back, and likely overshoot the minimum again, in the other direction. In the real world the ball eventually stops because of friction that drains its energy. In the momentum descent the friction is simulated by gradually reducing the momentum coefficient (which means growing friction). The trouble is, you don't know in advance, how much friction to put in. With too little friction, the ball will bounce around a lot, with too much friction it will be rolling in molasses. So people do things like resetting the momentum coefficient after some large number of steps. But if we look at each dimension, there is an easy indication of an overshoot: the gradient in that dimension changes sign. So at that time we can just immediately kill the momentum, and we don't even have to kill the whole momentum, we can kill it in just that one dimension. And this worked really well, to the point that the friction became completely unnecessary. The other place where the sudden stopping helps is when the variables try to go out out of range (for example, in our case we have the boundary condition that each dimension is non-negative).

This worked very well for a smooth (well, mostly, except for the creases added by LASSO bias, but then again the creases are at the sign change boundaries, and we always stay on the positive side) quadratic function. Would it work on a much more jagged functions used in the training of neural networks? This is what I've tried in the NN code in Triceps, and as it turns out, the answer is sort of yes, at least in certain limited ways, as I wrote at length before in the other posts, and the fraction of dimensions experiencing the stopping can also be used to auto-adjust the training step size (i.e. the training schedule - the existing algorithms like Adam rely on preset constants, here it adjusts automatically). It's currently been applied to plain gradient descent only, and not beating the stochastic gradient descent in the current form, but on the other hand, it does make the plain gradient descent with momentum not only work at all but work not that much worse than the stochastic descent. Check out the figure with NN training speed comparison in the appendix. Now at NeurIPS I've learned things that gave me more ideas for marrying this logic with the stochastic descent, so there are more things to try in the future.

For SEUDO, the trick has turned out to be the sparsity. It still works decently well when the kernels are placed less frequently, up to about 1/3 of the diameter. So with the sprites 30x30 you get a slow-down by a factor of 900, but place these sprites in a grid with the step of 10 pixels, and you get the factor of 100 back. It would probably be even better to place them on a triangular grid (basically, the same grid but with offset on alternating rows, and different horizontal and vertical step sizes to form the equilateral triangles of neighboring centers), to keep the distances between the centers even in every direction, but we haven't got around to do that (yet?).

And the way to make the detection of the cell images in the video (that's what realSEUDO is, for which incidentally the plain SEUDO is optional) fast was to do it with the old-fashioned image processing. It goes in two stages (well, and a noise reduction pre-stage, and it also expects that the image is already stabilized): the first stage finds the contiguous spots above the background brightness, then adds time as the third dimension and builds the sausage-like shapes of the cell activations. They are pretty jagged at the ends, as the cells start to light up in little spots, then these spots gradually expand and merge, and going dark again looks similar but in the opposite direction. The sausage gets eventually squashed along the time axis to build the reference cell image. The trouble is, there are multiple layers of cells in the brain, with overlapping cells sometimes activating at the same time. So the second stage is used to figure out these overlaps between the cells: when do we see the same cell but at a different brightness level, when do we see a completely different overlapping cell, and when both (or more) overlapping cells are lighting up at the same time. So it merges and splits the cell images accordingly, and generates the events telling what it's doing. At the same time it tries to fit the known cells into the current frame and generates the events for the cell activations (Triceps is used for reporting the events as a more efficient way to collect or forward them than Matlab). The splitting/merging code is very important but purely empirical, with a few important tunable coefficients chosen by trial and error at the moment. It's a tour-de-force of empiric engineering beating science. But perhaps some merging of both approaches is possible, maybe using the detected shapes as a starting point for CNMF-like further optimization. If the starting point is already very close to the optimum, the short path to the optimum would be fast (as we see in reusing the detected activation coefficients from the previous frame as the starting point for the current frame).

And there are other little things, like partitioning a large image into segments and processing them in parallel, and then splicing the cell shapes and activation profiles back together (this is not a new idea, CNMF does it too). Since all the high-level logic has been done in Matlab, the parallelism is a bit weirdly shaped by Matlab's abilities: there is parallelism on top with the partitioning by segments and on the bottom with the C++ code running FISTA but there is a good deal of parallelism opportunities in the middle of the splitting/merging logic that aren't used. The trouble is basically that Matlab doesn't have shared read-only data, it starts each parallelization by an expensive copying of the whole relevant state, and then ends by copying back the result. The other reason to use the segments comes from the overhead on the cell shapes: the way Matlab works, the shapes have to be represented as image-sized matrices, quadratically increasing the overhead of all the operations as the image grows. The sparse matrices (as the C++ part of the code already does) would fix that, there is such a recent new feature in Matlab, or obviously if moving the code from Matlab to another language. BTW, these are much simpler sparse matrices than ones produced in FISTA equations, here we know that all the non-0 values are located in a small bounding box, so they're trivial to implement.

There are also things to be said about the evaluation of detected profiles (which also comes up in splicing the segments), but it's a largish subject for another post.

P.S. The code used to produce the performance graphs in the appendix can be found in https://sourceforge.net/p/triceps/code/HEAD/tree/trunk/cpp/nn/demo/seudo-paper/. The data was pulled from the logs with scripts and then the graphs were built in Matlab, but those scripts will need to be changed from Matlab to something more easily available before publishing them.

P.P.S. Some graphs from the appendix, training of a neural network with the methods implemented in Triceps:

P.P.P.S. No, I don't work at Microsoft Research. This attribution is a screw-up made by Adam Charles, which I'm working on straightening.

Tuesday, October 17, 2023

comparing the FloatNeuralNet options

I've done a plot of the errors with multiple variations of the FloatNeuralNet options, using the Leaky ReLU as the least common denominator. I've added some too: for example, I've never tried before to do the momentum method without my improvements, so I've done a custom build without them. The results are interesting, I should use more graphs in the future (and add a proper graph generation rather than scraping the debugging log).

First of all, none of non-stochastic implementations have done as well as the stochastic one. Well, they do better for the same training rate, but the catch is that the stochastic approach is stable at a much higher training rate. It works fine with a 100 times higher training rate for the same initial seed, and starts getting rough but still even faster at a 1000 times higher training rate.

The second observation is that without my changes the momentum method doesn't work at all. At all. It just breaks down right away.

The third observation is that my auto-adjustment of the training rate starts quite fast, somewhere between stochastic method's 100x and 1000x training rates but then flattens out and falls behind. So even as-is it's probably a valuable technique for quick short training.

As for why it flattens, my hypothesis is this: Imagining the optimization as a ball rolling down the hill along with the gradient, killing the momentum in the dimensions where the gradient keeps changing sign prevents the ball from wobbling in the trough and jumping out of it. It works essentially as a shock absorber. But what if the trough doesn't go predominantly in one dimension? What if it's going at a diagonal in multiple dimensions? Then the damping that prevents wobbling will also be slowing down the ball. This is I think what happens when the optimization slows down. Fortunately, the mathematicians have a way to find the dominant direction of the trough and rotate to it, the Principal Component Analysis does that. Then, I guess, the same momentum method can be applied to the rotated dimensions, and then rotated back. Unfortunately, this whole procedure is probably quite expensive. And as the ball rolls, the rotation will change, so the post-rotation dimensions will change, and it's not clear, how to match them between iterations. Maybe the rotations should be recomputed only infrequently, when the progress slows down: this will both amortize their costs and keep stability between iterations.

Saturday, February 18, 2023

arrested bouncing in FloatNeuralNet

It's been a while since I wrote any updates about my experiments with optimization of neural networks. Partially, it's because I've been busy with other things, and partially because I've tried something a few weeks ago, it didn't work out well, and I wasn't sure what to do with the results. On one hand, it didn't work out, so carrying this complexity into all the future code would be annoying, and even gating this code behind an option would create unpleasant effects on the performance. On the other hand, I think these experiments were really interesting, and maybe I'd want to do another variation of them in the future, so I didn't want to just discard the code (I've already discarded some code in the past only to re-create it later). So I've been stuck thinking what to do with it, and I couldn't do the other experiments in the meantime.

I think I have a good solution for this now: the "mini-branches". I've copied the original and modified code, and the description into cpp/nn/xperim, one subdirectory per "mini-branch", where it can be found, and diff-ed, and reapplied if needed.

The mini-branch for those last experiments is https://sourceforge.net/p/triceps/code/1803/tree/trunk/cpp/nn/xperim/001802-bounce/ (1802 is the root revision for this mini-branch in SVN on SourceForge).

Both of these experiments center around the momentum descent. To recap on the previous state, I've been using the momentum descent loosely based on FISTA, with some extensions. One of these extensions is that whenever the gradient changes sign in some dimension, I kill the momentum in this dimension. Another is that I use these gradient sign changes as an indicator to auto-detect the safe basic gradient descent rate: I keep growing this rate until many (for some definition of "many" that I've been trying to tune with various experiments) dimensions start flipping the gradient signs after each step, then I back it off, then continue growing again.

The premise of the first experiment started with me noticing that I've unintentionally changed some logic from FISTA. FISTA has an interesting computation for the coefficient "nu" that determines how much of the momentum is used in each step. It starts with not having any momentum to apply on the first step, then on the second step it skips the momentum too, then it bumps nu to 1, and then gradually reduces it to "apply the brakes" and avoid circling too much around the optimum. But killing the momentum on the gradient sign change serves as an alternative application of the brakes.So I've removed that gradual reduction of nu, and things worked better.

But I've noticed that I didn't handle the second step right. It wasn't keeping the momentum at 0. My guess has been that this was done to avoid including the large first step in random direction from the random initial point into the momentum. So I've set to fix it. But it made things worse, the error reduction had slowed down. Evidently, the first step is not so random after all. Or maybe my momentum-killing code is very good at eliminating the effects from the truly random first steps, and the remaining ones are useful.

The premise of the second experiment was to detect the situation when some dimension has the sign of its gradient bouncing back and forth on each step without reducing much in absolute value. This is what happens when the model starts tearing itself apart due to the descent step being too large: it keeps overshooting the optimum by a larger and larger value. It could also happen because of the momentum, but the momentum already gets reset to zero in my code when the gradient sign changes, so when we detect a gradient dimension change signs twice in a row, and get to more than 0.75 of the original value, that's because of the large descent rate. So how about we make only half the originally intended step in this dimension? It should land us much closer to the optimum. That's what I've done in this experiment.

And it didn't work out either, the convergence became slower. The potential explanation for this effect is the following: Consider a trough with a ball rolling down it. The trough is not exactly straight, and the ball isn't started exactly straight, so as it rolls down the trough, it keeps wobbling in the trough left and right, overshooting the center one way then the other way. As long as the ball doesn't go into resonance and starts wobbling more and more (eventually falling out of the trough), this wobble is actually not such a big deal. But when I dampen the wobble, it also dampens the speed of rolling down the trough, and since the momentum also gets killed on every wobble, the whole descent slows down.

So it's something to think about for the future. It probably means that the speed could be improved by not killing the momentum on every change of the gradient sign, but only when the absolute change of the gradient becomes too large. But I'm not sure yet, how can this "too large" be defined reliably.

It likely also means that going to the individual descent rate coefficients for each dimension (something that I contemplated from both my own experiments and from reading the papers) might not work so well, because it's likely to cause the same kind of slow-down. We'll see, I'm now unblocked to try more experiments.

not MNIST

I've been reading a little more about what other people do with the MNIST recognition. Apparently, doing the transformations on the training set is fairly common, so maybe a simple linear stretching and shrinking to multiply the number of examples would be good enough.

But the most interesting discovery has been that the dataset I've been playing with is not MNIST. Surprise, surprise. :-)

The MNIST data set is much larger, and also has a higher resolution. It might be the older NIST dataset (without the "M" that stands for "Modified"), or maybe not even that. For some reason I've thought that the MNIST set comes from the ZIP codes recognition, so that's what I was looking for, the ZIP codes dataset, and apparently I've thought wrong. An interesting thing about the original NIST dataset is that it had the training set and the test set collected from different demographies (one from schoolchildren, another from employees of a government agency), so I guess if this is what I've got, it would explain why the sets don't represent each other so well. That was apparently a known major complaint with the older dataset that got straightened in the new modified one.

Sunday, January 15, 2023

trapeze to btimap for MNIST

Continuing with the trapeze-based implementation of the handwritten images that I've described in the previous post, next I've tried to do a reverse transformation: from the set of trapezes to a bitmap that tries to preserve the significant features detected by the trapezes. And if you've read the previous post, you've probably can guess the result: it became a little worse yet!

Looking at the images that got misrecognized, I think I understand why. Think for example of a "4" and "9". "4" is often drawn with an open top. But when we draw a "9", there is also sometimes a small opening in the top right corner. When I do the detection of trapezes, this little opening becomes a significant feature, and when I convert these trapezes back to a bitmap, this little opening becomes a larg-ish opening, so now "9" became more like "4"! I guess, with a large and representative enough training set, they could be differentiated well. But not if all the "9"s with an opening are in the test set (and yes, if I fold the test set into the training set, it gets recognized quite well).

While looking at it all, I've also noticed a few more interesting things. I've turned the classifier mode on again (and improved it a bit to make the gradients more stable), and noticed that it drives the error on the training set down a lot. The classifier mode tries to make sure that the training cases that produce the wrong result get more attention by multiplying their gradients, which is essentially equivalent to adding 99 more copies of the same training case. Turn the classifier mode on, and the training error goes down from 0.06 to 0.025 in almost no time. Which means that a lot of this error is caused by the few outliers. And maybe another way to fix it would be to take a larger power in the loss function, say the 4th power instead of square. But the error on the test set doesn't budge much, so maybe this is a misleading measure that doesn't matter much and can cause overfitting.

I've found a bug in my code that computed, how many cases got recognized with a very low confidence: if even with the right outcome the absolute value for the highest outcome was still below 0. Due to this bug, all the cases that were labeled with "0" were counted as low-confidence. So this measure wasn't 17.5% on the training set, it was more like 1%, and in the classifier mode goes to near 0. But on the MNIST test set, the total of misrecognized and low-confidence cases is still near 20%.

The auto-adjustment of the descent rate still works well, and I've bumped it up to be more aggressive: changed the step up from 1.1 to 1.2, and the step down from 0.1 to 0.2. Now I think it exhibits the behavior that I've seen with a higher manually-set rate, where once in a while the error would bump up a little, explore the surroundings, and then quickly drop down below the previous low. And the attempts to start the destructive resonance are still well-arrested.

I've added a printout of gradients by layer, and they do tend to vary in the interesting way, kind of "sloshing about". At the start of the training the high layers usually show the high gradients, and the low layers show the high gradients. But this gradually reverses, and after a few thousand training passes you get the high gradients in the low layers, low gradients in the high layers. Unless you then go and change the some training criteria, then the high layers get the higher gradients again until things settle. Which probably means that selecting separate training rates by layer might be worth a try. Or maybe even selecting them separately for each weight, like the Adam algorithm that I've mentioned before does (I haven't tried this specific algorithm yet).

Sunday, January 8, 2023

trapeze data representation for MNIST

As I've told before, the trouble with MNIST data set is that the test set contains the cases that are substantially different from any case in the training set. So doing well on it requires some kind of generalization on the training set to be able to correctly recognize the test cases. One idea I've had is to parse the images into the logical parts that look like trapezes.

For example, consider an image of the digit 0. It can be broken somewhat like this:

_   /===\   _
   //   \\
_ //     \\ _
  \\     //
_  \\   //  _
    \===/

Here going from the top we have a horizontal bar that can be seen as a horizontal band containing trapeze of whitespace, followed by a drawn trapeze, followed by another whitespace trapeze. I've separated it by the underbar marks on the sides. Then the sides go expanding: there is still whitespace on the left, then goes the bar expanding to the left, an expanding trapeze of whitespace in the middle, the bar expanding to the right, and another whitespace on the right. Then it goes symmetrically shrinking on the bottom side, and completes with another horizontal bar. There could also be a vertical part on the sides in the middle. Once we recognize the break-down like this, it doesn't matter any more, what exact size is the symbol 0, it's still the same symbol. It gets harder for the many handwritten variations but maybe we get enough samples to build a recognition of all the major ones. So I've decided to try this path wit MNIST.

Recognizing the trapezes in a strict black-and-white image is much easier than with the halftones. So the first thing I've doe was to generate a B&W image by filtering on a fixed level. I've done this, and since I've had it, I've tried to train a model on it. And it did a little worse than on the halftone image.

Well, the next simple approximation is to encode the rows of pixels as run-length. So I've done it, and I've tried to train a model, and guess what, it has done a little worse yet. Which is a bad sign. Part of the problem is that each row becomes of a different size in runs, and to keep the fixed size, the unused runs at the end have to be filled with something. Originally I've filled them with 0s, and then I've tried to fill them with -1. Which made thins a little better but still a little worse than plain B&W image.

At this point the indications were kind of bad but if I didn't try the trapezes, I wouldn't know, right? So I've tried. I've encoded each trapeze with 3 numbers:

(1) The average slope coefficient of its left side, computed as dx/dy (since a vertical line is possible and a horizontal line is not). Which is also happens to be the slope of the right side of the previous trapeze. It's average because in reality the pixels represent the curved lines, and I just approximate them with the straight trapezes.

(2) The width of the top of the trapeze

(3) The width of the bottom of the trapeze.

To get the ordering of whitespaces and drawings right, I always start each row of trapezes and also end it with a whitespace, and whatever gets left unused gets filled with -1 for widths and 0 for slope.

There is also one exception: since the leftmost whitespace trapeze always has its left slope vertical (i.e. 0), I've reused that value to put the row height into it.

And the result? By now you've probably guessed it, it was a little worse yet (and a lot slower to compute each pass because of the increased input size) than the run-length version. So this representation has turned out to be not very good for generalization, although not terribly bad either. Maybe it can be twisted in some way to make it better. Maybe put it back into a bitmap by treating each trapeze as one pixel, but by now I'm rather pessimistic on these prospects :-)

While experimenting with it, I've tuned the autoRate2 coefficients to make them a little more aggressive, and they had proved themselves quite well. To check how my auto-scaled momentum measures against the classic stochastic descent, I've tried that one too, and the stochastic version did noticeably worse. I've also tried going to Leaky RELU activation, and that one did a little worse yet. So I think at least the descent part and my Corner activation function are working decently well.

Thursday, January 5, 2023

unmelting a neural network

I've got wondering, could a neural network that experienced a "meltdown" from a too-high training rate be restored? I've got such an example of MNIST and dumped out its contents. What happened inside is that pretty much all the weights got pushed to the values of 1 and -1, the neurons becoming very much the same. So when trying to train it again, this lack of, pardon my French, diversity, wouldn't let the optimization to progress far: it just optimizes this one neuron state per layer in many copies.

For some time I've had an idea of how this problem could be resolved: by a partial randomization of gradients. So I've done this with a simple saw-toothed "randomization" where each next weight gets its gradient reduced slightly more up to a limit, then it drops back down and the next "sawtooth" starts. The starting position of the teeth gets shifted by one on each training pass.

I've started by "tweaking" in this way 1% of the gradient size (I've named this option of FloatNeuralNet tweakRate_), and combined it with options momentum_ and autoRate2_. It did work some but barely made a scratch. OK, so I've bumped it up to 30%. It did work much better but still not quite great. Over about 300K training passes it got to the mean-square error of about 0.15 (by comparison, the normal training from a random point gets to about 0.05 in 10K steps) and would not budge much any more. The verification on the test cases was much closer: mean-square error of about 0.24 instead of 0.19 and the error rate of about 7% instead of 5.5%. So it might be not that bad after all. The combination of the new auto-rate and momentum descent worked great at preventing another meltdown. Interestingly, at the start all the gradient was concentrated in the last layer of 3, and then it gradually shifted towards the first layer.

Then I've tried the same tweak rate of 0.3 for the training from a random initial state, and it didn't have any detrimental effect at all, even did slightly better than without it. So it should be safe to use in general case as a cheap preventative measure.

This also gave me an opportunity to look more into the tuning of the auto-rate algorithm, which I've made a bit better, and also look into what gradients are where. As it turns out, the highest gradient dimensions by far are at the weights that have reached the [-1, +1] boundary, and they skew the norm2 and mean-square of the gradient a lot. When I've changed the code to mark such gradients post-factum as 0, that gave me an opportunity to count them separately and to exclude them from the means. Their number grows early in the draining and then gradually reduces (but not to 0).

How about if we raise the boundaries, this should reduce the number of the dimensions hitting them and make the training faster, right? And it also would be a good test of the new auto-rate algorithm, since as I've shown before the weights over 1 are much more susceptible to meltdowns. I've tried the boundaries of 10 and 100. The auto-rate worked great but the training got slower. For all I can tell, the higher weights trigger more often the situations where the auto-rate algorithm drops the training rate down, and the rate tends to be 10 to 100 times lower.

But the bad news for the auto-rate logic is that manually picking a just-high-enough training rate still ultimately produces a slightly better result. The auto-rate algorithm starts with a similar rate but then gradually drops it by about 3 orders of magnitude. And as I've been watching the mean-square errors pass by pass, I could see that they showed differently: the fixed-rate algorithm would periodically have the error grow and maybe even stay up a while but then drop lower than where it was before, the auto-rate algorithm tends to just chisel away at the error rate little by little, it still has the error grow a little periodically but squashes it very fast. So perhaps the conservatively low rate gets the function trapped in some local minimum, while the fixed rate breaks out of them (when it doesn't lead to a meltdown). If I let the auto-rate algorithm grow the rate more, and then drop when it gets out of control, it actually does worse. But maybe some better adaptivity could be devised.

And/or maybe bring the stochastic descent back into the mix. I've been computing the full gradient because this way any kinds of postprocessing represent a relatively low overhead done once per training pass, doing the same after each training case would be very slow. But it's much more resistant to the too-high descent rate, and should be able to shake out of the local minimums better. So maybe they can be combined, doing a few passes stochastic then a few passes deterministic, and so on, with the rate computed at the deterministic passes fed to control the stochastic passes.

Monday, January 2, 2023

optimization 13 - using the gradient sign changes

When I've previously experimented with FISTA momentum descent, one thing that worked quite well was detecting when some dimension of the gradient changes sign, and then resetting the momentum in this dimension to 0.

One of the typical problems of the momentum methods is that they tend to "circle the drain" around he minimum. Think of one of those coin collection bins for charity where you get a coin rolling down the trough, and then it circles the "gravity well" of the bin for quite a while before losing momentum and descending into the center hole. This happens because the speed (momentum) of the coin is initially directed at an angle from the minimum (the center hole). And the same happens with the momentum descent in optimization, the momentum usually develops at an angle, and just as the current point gets it close to the minimum, the momentum carries it by and away to the other side of the "well", where it will eventually reverse direction and come back, hopefully this time closer to the minimum. But there is a clear indication that we're getting carried past the minimum: the sign of the gradient in the direction where we're getting carried past changes. So if we kill the momentum in this direction at this point, we won't get carried past. It helps a good deal with the quicker convergence.

An overshot is not the only reason why a gradient's dimension might be changing sign. Our "virtual coin" might be rolling down a muti-dimensional trough, oscillating a little left and right in this trough. But there killing the momentum wouldn't hurt either, it would just dampen these oscillations, which is also a good thing.

Which gave me idea that once we have this, there is no point to the gradual reduction of momentum that is embedded into FISTA through the parameter t. If the momentum reduction on overshot gets taken care off as described above, there is no point in shrinking the momentum otherwise.

And then I've thought of applying the same logic to estimating the training rate. I've been thinking about the Adam method that I've linked to in a recent post, and it doesn't really solve the issue with the training rate. It adjusts the rate between the different dimensions but it still has in it a constant that essentially represents the global training rate. In their example they had just come up with this constant empirically, but it would be different for different problems, and how do we find it? It's the same problem as finding the simple raining rate. The sign change detection to the rescue.

After all, what happens when the too-high rate starts tearing the model apart? The rate that is too high causes an algorithm step to overshoot the minimum. And not just overshoot but overshoot it so much that the gradient grows. So on the next step it overshoots back even more, and the gradient grows again, and so on, and so on, getting farther from the minimum on each step. And the momentum methods tend to exacerbate this problem.

So if we detect the dimensions that change sign, and see if the gradient in them grows, and by how much, and how it compares with the gradient in the dimensions that didn't change sign, we'd be able to detect the starting resonance and dampen it by reducing the training rate. I've tried it, and it works very well, check out https://sourceforge.net/p/triceps/code/1792/tree/, it's the FloatNeuralNet option autoRate2_. So far the tunables for it are hardcoded, and I think that I've set them a bit too conservatively, but all together it works very well, producing a little faster training than I've seen before, and without the meltdowns.

Another thing I've changed in the current version is the logic that pushes the rarely-seen unusual cases to be boosted for a better recognition. It previously didn't work well with the momentum methods, because it was changing the direction of the gradient drastically between the passes. I've changed it to make the boosting more persistent between the passes, and instead of shrinking the gradients of the correct cases, to gradually grow the representation of the incorrect cases, expanding their gradients. It's still a work in progress but looks promising.

Oh, and BTW, one thing that didn't work out was the attempt to boost similarly the cases that give the correct answer by having the highest output point to the right digit but do this at a very low confidence, so that even the best output is below 0 (and sometimes substantially below 0, something like -0.95). All I could do was shrink the percentage of such cases slightly, from about 17.5% to about 16.5%. And I'm not sure what can be done about it. I guess it's just another manifestation of a great variability of handwritten characters. Maybe it could be solved by vastly growing the model size and the training set, but even if it could, it would be nice to find some smarter way. Perhaps a better topological representation of the digits instead of a plain bitmap would do the trick but I don't know how to do it. One of the theories I've had was that it's caused by a natural trend towards negative numbers, because in each training case we have one output with 1 and nine outputs with -1. So it we changed the negatives to say -0.1, that would pull the numbers higher. But that's not a solution either, it just moves the average up, diluting strength of the negatives.

a quick test of theory about MNIST

I've had this theory that the test set in MNIST just contains the digit images that are substantially different from the training set, and this is the reason for why they're not recognized well. I've come up with a quick way to test this theory: just merge the test set into the training set, and see if it make any difference.

It does. When the test set is included into training, it gets recognized very well. There still usually are a couple of images that have issues but that's down from about a hundred. I think it tells us in two ways that the test set contains different images:

(1) If they start getting recognized when trained on them, this means that the training set just doesn't train for the right thing.

(2) When the NN has a hard time even training on some images, this means that the abilities of the NN are getting stretched, that it doesn't have enough brain power to cope with all the different possibilities.

Then I've tried adding some brain-power to the NN. First I went back to the original 16x16 images instead of the shrunk 8x8 but kept the width of the NN layers the same. This grew the cost of the first layer 4-fold but the others stayed the same, so the code didn't slow down so much. This did help some, both with the original and with the mixed training set. But not spectacularly. Then I've also expanded the first layer of neurons 4-fold. This made things slower yet, and provided another improvement, but again nothing spectacular.

I think there are just too many ways to draw the digits - slant one or another stroke a little more or a little less, or use a thicker pen, and the digit suddenly has a stroke where it used to have a hole and the other way around, and the NN gets confused. Also, some digits turn out to be unexpectedly similar, such as 6 and 2. When we draw them by hand, they grow loops around the corners, and more loops where the pen doesn't quite lift between the digits. So both 6 and 2 and up looking as a vertical loop with an opening that's connected to a horizontal semi-loop. The only difference is in the direction: 6 has the vertical loop opening on the right and the horizontal loop opening on the left, while 2 has the vertical loop opening on the left and horizontal loop opening on the right. And the way the strokes shift around, it's easy for the NN to get confused.

I don't know how do they do the handwriting recognition in the reality. My guess is that some preprocessing of the images that extracts the topology of the strokes into a more explicit representation should help a lot. I guess it's also possible to generate more of the sample images for the training set by stretching the existing ones in different ways but that sounds like a dead end. A variation of it might be to use the batching again, but this time include only the images of the same digit into a batch.

Sunday, December 25, 2022

optimization 12 - adaptable methods of neural network training

In the last post I've been wondering, why aren't the momentum descent methods widely used for the neural network training. And guess what, a quick search has showed that they are, as well as the auto-scaling methods.

In particular, here is an interesting paper:

https://arxiv.org/pdf/1412.6980.pdf

Adam: A Method for Stochastic Optimization

What they do looks actually more like a momentum method than selecting the rate of descent. This paper has references to a few other methods:

AdaGrad
RMSProp
vSGD
AdaDelta

And I've found this algorithm through another web site that talks about various ML methods:

https://machinelearningmastery.com/adam-optimization-from-scratch/

Thursday, December 22, 2022

optimization 11 - FISTA momentum descent in FloatNeuralNet

Continuing from the previous post, I've had a look at the characters in the test set that have trouble getting decoded. Well, they truly are handwritten, I've had difficulties recognizing some of them as digits at all. But a good deal of them are rather similar variations on the digit "6" that I guess are just not represented well in the training set. For many of these digits the even highest recognized class has the value deep in the negative zone, which means that they're not like anything in training. I think there should be some way to adapt the training set, such as expand it by including the digits shifted left and right by a pixel or two. Need some more thinking on that.

I've looked at the training set and counted how many training cases have their best guess negative (even if it's the correct one), and it came out at about 16.5%. I've tried to shift the training weights to prioritize cases with the low best guesses just as I've done with the wrong guesses, and it didn't help at all, the best I could scrape out was about an extra 0.2%. Maybe it's just because for every 1 positive output there are 9 negative outputs, and on average the values tend to be pulled down.

In the meantime, I've added the momentum descent, based on the FISTA algorithm. I've learned some tricks playing with my version of FISTA that I plan to add later, but so far just a very basic version, with only one trick included: when applying the momentum, compute the saturation by the limits, and adjust the momentum for it. So that if some dimension hits the limit, the momentum won't continue pushing it past the limit again and again but will reset on that dimension.

I've been trying to get the automatic descent rate deduction working properly before implementing the momentum but I gave up on that for now and decided to just try the momentum. It is a bit sensitive to the training rate being too high, maybe because there is a higher chance of driving into a situation where a high rate would give an overshoot. But bump the rate down, and the momentum descent works very, very, very well, doing a much faster progress. The mean square error for the training set went down from the previous-best 0.025 to 0.01. The bad news is that it didn't help with the test set at all: it again goes better up to a limit, and after that as the training set's stats get better, test set's stats get worse. So I think I really need to accept that the test set really truly contains the characters that are not well represented in the training set. And probably the convolution won't fix it either.

Given that the momentum descent works so awesomely, I don't understand why isn't everyone using it. It adds cost per pass but then it reduces more the number of passes needed to achieve the same precision, giving a large net win. Or maybe they do use it and don't tell everyone. Or maybe it doesn't work well for all the neural networks and I've just got lucky with my example. I don't know.

The way FISTA works is basically by remembering the last step and repeating it before computing the gradient at the new position and doing a new step. Then remembering the total step (repeat + new step) for the next step. It also has a little optimization in scaling the remembered last step by a factor a little under 1, and very gradually reducing this factor. This downscaling is intended to reduce the overshoots and "circling the drain" when the position gets close to the optimum. It also limits the maximum momentum, when the downscaling starts shrinking the momentum more than the current gradient accelerates it. Another little thing it does is that it skips the repeat not only on the first step (where it just has nothing to repeat) but also on the second step. For all I can tell, the reason for this skipping of momentum on the second step is that the initial point is usually random, and if the gradient approximation is good, it would put the point after the first step into a decent vicinity of the optimum. Which means that the first step will be large, and also stepping in a random direction that is nowhere close to the direction of the following steps. Thus it's better to not include this first large random step into the momentum that will keep repeating.

The mechanics of enabling the momentum mode in FloatNeuralNet are: set the option

options.momentum_ = true

And now there also are methods for reading the best-result-still-below-0 stats:

getPassAbove() returns the number of cases in the last pass where the correct and best outcome is above 0

getPassNotAbove() returns the number of cases in the last pass where the best outcome is either incorrect or not above 0

Tuesday, December 20, 2022

FloatNeuralNet classifier training mode

Continuing the story from the previous post, I've been experimenting with the MNIST classifier of handwritten digits from ZIP codes some more, and I've tried to improve on it by adding a special mode for gradient computation:

The point of a classifier is that it produces a result with the highest value at the output that corresponds to the correct classification, and the exact values on the outputs don't mater that much. So when training, it makes sense to check whether a particular training case already produces the correct result, and if so then "dampen" it to give the training cases that are still producing incorrect results a greater chance to shift the weights for their benefit. I've done this "dampening" by multiplying the gradients of the correct cases by a value less than 1. I've started with this value set at 1/1000 but that tended to produce rather wild swings where a small number of incorrect cases would pull the weights towards themselves only to make a large number of formerly correct cases incorrect. I've reduced it to 1/100 and that worked out quite well. I'm not sure if the selection of this multiplier depends on the number of training cases, FWIW, the MNIST set has a little over 7000 training cases.

And to make sure that the training doesn't get too slow, I've added the auto-rescaling of the training rate, so that the potential sum of the gradients would total up the same. This made it a bit more touchy, since there is a greater chance of the gradient getting dominated by only one training case, and a smaller base rate (I've reduced it by a factor of 3) was needed to avoid the meltdown into overshoots. Maybe I'll embed this safety factor into the auto-rescaling at some point, I'm just not sure yet how to pick it for the general case.

The end result is a bit mixed. It worked very, very well for the training set, bringing the error rate (i.e. the number of cases that produce a wrong value) to zero. And the mean square error went the same or slightly better as with the plain implementation, meaning that not only the previously neglected cases are getting attention but also they get incorporated better into the common model and don't hurt the others. The mean square error on the test set also went slightly better, the best I've seen so far at 0.193 (an improvement of almost 0.01 over the plain version). But the error rate on the test set, which I hoped to improve, still stayed at around 5.5%, not really an improvement there. I've seen it go down to 5% and then grow again on the following rounds of training, while the mean square error became better.

So for all I can tell, the test set really does contain some characters that are substantially unlike any of the characters in the training set. And probably the only way to do better is to actually look at them and figure out what is different about them and then manipulate the training set to introduce variability by generating multiple slightly different versions of each character. I've kind of got interested in this puzzle, so maybe I'll find time to try this. Or maybe I'll try to implement the convolution first but by now I doubt that it would help much with this specific problem on its own. Maybe a general way to approach this variability would be to do some unsupervised classification of the images of the same digit, to produce the groups of distinct ways to write it. But that looks difficult.

Another simple thing to try would be to make the classifier preference even stronger, requiring that not only the correct output is the highest but also that it beats the next highest output by a certain margin. And maybe even vary the gradient weight in a reverse proportion to this margin. But I doubt that it would solve the problem with the stubborn 5.5% of the test set.

Yet another interesting development is that the CORNER activation seems to work quite well. I've added the third layer with it, and it produced better results. My previous impression that the deeper networks train slower was based on a limited number of training passes, the larger models are slower there, but as the pass count goes into multiple thousands, more neurons do help to squeeze out a better quality. Of course, each pass is slower with more neurons, but as far as the progress per pass is concerned, the larger networks do better on the later rounds.

Where the new feature is in the code:

https://sourceforge.net/p/triceps/code/1779/tree/

Setting the FloatNeuralnet option isClassifier_ to true enables this new classifier training mode. The option correctMultiplier_ defines the penalty gradient multiplier for the cases that are already correct (the default of 0.001 there is a bit too much of a penalty, 0.01 looks better). And since the correct and incorrect cases are counted for re-scaling the training rate appropriately, these counters are now also available externally too via methods:

size_t getPassCorrect() const;

size_t getPassIncorrect() const;

This allows to compute the error rate easily from the results of a training pass. They get computed only if the classifier training mode is enabled, and they show the error rate as it was before the last training pass, it will change after applying the gradient. The classifier mode can be used with both the piecemeal steps of applying the training rate in train() method and with applyGradient(), but only applyGradient() is able to do the automatic scaling of the training rate to compensate for the shrinkage.

Sunday, December 18, 2022

FloatNeuralNet on the MNIST set, and checkpointing

Since I've been experimenting with neural nets on an "unnatural" example, I've been wondering, how would the things I've tried work on a more natural one. So I've made a demo for the MNIST example of recognizing the handwritten digits from ZIP codes. It's implemented in the branch https://sourceforge.net/p/triceps/code/1776/tree/ , in the file nn/demo/d_FnnZip.cpp.

The code looks for the MNIST files that can be downloaded from https://hastie.su.domains/StatLearnSparsity_files/DATA/zipcode.html placed in any ancestor directory's subdirectory "zipnn" (so I usually put it next to Triceps's top directory).

The original files are scans with 16x16 dots per digit but it has turned out to be rather slow in processing, so I've made the demo downscale them to 8x8 before processing.

Another thing that was rather disappointing was that I didn't save the state of a trained NN before. It's OK for small and quick training but with a slower and longer one, the results really need to be preserved. So I've added the FloatNeuralNet methods checkpoint() to dump the state to a file and uncheckpoint() to get it back. It can also be used creatively to do some training with some settings and then load it into an instance with different settings to see if something would improve.

With 8x8, it's 64 inputs, and 10 outputs for 10 recognized digits, and as a baseline I've used 2 inner layers of 64 neurons each: 64->64->64->10. But I've experimented with more and fewer neurons, and more layers.

So, what are the observed results?

First, the RELU variations work quite decently, probably because the original data is already centered on the [-1, 1] range.

The larger models are slower to train, nor just in execution time per pass, but also each pass makes a smaller improvement.

The web page talks that it's a hard problem and the error rate on the test set at 2.5% is very good. Well, the best I could do so far is about 5%. The problem with this data set is that the test set differs a good deal from the training set. So when the training set gets to a mean square error of 0.07 and the error rate of 0.5% (and keeps improving), the test set gets stuck at a mean square error of 0.2 and error rate of about 5%. Worse yet, as the error rate on the training set keeps getting a little better, on the test set it gets a little worse. Which means that the NN makes a close fit for the training data, it doesn't fit the test data. So I've tried to reduce the number of neurons per layer to 32, and the result got a little worse overall, and it still does this divergence between training and test data.

The CORNER activation showed not spectacularly but decently. Since it has a "mini-layer" on top of each normal layer, I was able to remove one layer to sizes 64->64->10, and it did just a little better than LEAKY_RELY in 64->64->64->10, and each pass got computed faster. In fact, the best seen error rate of 5% came with CORNER at about 10K passes and then deteriorated a little to about 5.3% at 20K passes, while the best I've seen with LEAKY_RELU is 5.5%. But those small differences may also be due to randomness. I've used the same starting point in the random number generation but with the different number of weights it plays out a bit different. Adding an extra layer with CORNER brings the training mean square error to below 0.05 and error rate down to below 0.3%, and even the test mean square error goes below 0.2 (which it didn't do in other combinations), but the test error rate stays at about 5.6%, so it's improving the fit but still overfitting worse than without the extra layer.

With the 32-neuron middle layers, the test error rate gets a bit worse, around 6.6% after 20K passes of training.

I've thought that maybe one of the sources of trouble is that with the inputs having absolute values less than 1, and with all the weights less than 1, it might be hard to produce the outputs of +-1. So I've tried to set the expected outputs to +-0.1, and it really made no difference. Increasing the maximum weight limits to +-100 didn't make a whole lot of a difference either (but made things slightly slower to train). Interestingly, taking a trained model with limits of +-100 and trying to continue training it with limits of +-1 produces a quite bad result that has a hard time digging out of the suboptimal point.

Applying the changes after each training case vs at the end of the pass didn't make a noticeable difference either. I'm thinking that applying at the end of the pass should provide an obvious way for parallelization: compute the partitioned set of training cases in multiple threads, each one summing up its own copy of the gradients, and then sum the partial per-thread results together.

Using the "floor" option that makes the sign-flipping of the weights more "structured" (on the first push it stops at the same sign but a small non-0 value, on the second push it switches the sign of that small value, and on the following pushes it would grow with the switched sign) made the training a little slower.

The automatic training rate estimation didn't work well at all and I don't know yet how to fix it.

What things to try next?

One obvious thing to combat overfitting would be to try some batching, and it should make the processing faster too.

Another idea I've got is that for a classifier that chooses only one category (the one with highest value), it's more important that the right neuron gets the highest output than to hit the target values exactly. So maybe taking a page from Adaboost's book, the training cases that produce the right highest output should have their gradients reduced (since they are "already right"), to give the training cases that are still wrong a higher relative weight for the next move.

And the next idea yet would be to try the convolution, which would be another step in adding complexity.

P.S. I've tried the batching. It made the error rate worse, and the larger the batch size the worse it gets. A kind of good news is that it worsened the training error rate more than the test error rate, so it definitely works against the overfitting, at least kind of and somewhat. Well, the training error rate is still lower than the test one, but with a lower relation between them.

Sunday, November 27, 2022

optimization 10 - Lipschitz costant and the last layer of neural networks

After, as described in the last post, the empirical computation of Lipschitz constant for the lower layers of a neural network didn't pan out, I've tried doing it for the last layer. The last layer minimizes a honest squared function, so it should not have the same issues as with the previous layers. And I've done it in code, see https://sourceforge.net/p/triceps/code/1770/tree/.

It computes 1/L empirically like the TFOCS library: keeps the weights and gradients from the last pass (including, by the way, the inputs to the last layer too, since they are variables in the formula too), finds the differences, computes the norms and eventually 1/L, and compares it with the descent rate used for the pass. If there is room, it gradually grows the rate. If 1/L had shrunk and the last rate was too high, it reduces the rate and redoes the previous step (this is somewhat tricky, as the gradient is computed only in the following step, but I've got the rollback to work). Most of the time one rollback is enough to set things straight. But if not, well, it's just one overshoot that would get straightened in the following steps, so I've limited the rollback to be done no more than once in any case.

The results are rather ambivalent. If the whole model consists of one Corner-activated neuron, it works great, finds the right rate, and converges quickly. The extra overhead is definitely worth the trouble there. Though interestingly, it first finds a solution with a slightly lower squared error, and then as it does more steps, the gradient reduces but the squared error grows a little. My guess is that it's probably caused by the adjustment to the Corner's offset weight that is not tied directly to the error, and what it thinks is the best balance doesn't produce the lowest error. Something to consider in the future.

With more neurons, things go worse. One aspect that definitely works well is preventing the runaway growth of the weights to infinity. I've tried an example that experienced such a growth before, and it didn't diverge any more, even though it produced some weights around 50. But the trouble is that the found descent rate is way too high. It works much better if I limit the rate to about 1/10 of the computed. It descends there rapidly but then gets stuck circling with about the same error. Reducing the rate to 1/100 makes it converge slower but gets to a lower error. 1/1000 produces an even lower error. Which means that it gets the rate wrong, and just deciding on the descent rate from the last layer is not enough, the lower layers also need to be taken into account. The idea with estimating the error in the previous layer from both gradient and descent rate of its successor should work here but it might involve a good deal of overhead.Needs more thinking.

On the ReLU and Leakey ReLU it works very similarly but doesn't resolve the issue with centering the offsets, and the rate overestimation issue is particularly sensitive for the basic ReLU, since there it can easily lead to neurons becoming dead.

So I've thought alright, maybe the overshoots are caused by the weights flipping the sign too easily with the larger rate. So I've added the idea of "weight floor": defining a small number (let's call it r) as the floor and making the values (-r < weight < r) illegal for the weights. Then if the descent tries to push a weight towards 0, crossing the +-r boundary, we'd stop it at the appropriate r or -r for this pass. If on the next pass the descent keeps pushing the weight further in the same direction, we'd "teleport" it from r to -r or the other way around from -r to r. This would prevent the large jumps, and avoid getting stuck at 0, and give a chance for a new gradient to be computed before going further, so that if it was an overshoot, it would get a chance to go back without causing a change of sign. And if different weights end up moved at a different rate, it shouldn't matter much, because they are still moved in the right direction. Overall it works, and the "teleports" do happen occasionally. But it didn't resolve the overshooting issue. I guess the trouble is mostly with the overshoots that don't cause a sign flip.

Saturday, November 26, 2022

optimization 9 - Lipschitz constant and earlier layers of neural networks

I've tried to follow through with the path described in https://babkin-cep.blogspot.com/2022/10/optimization-8-descent-rate-and-v.html, find the Lipschitz constant for the neural network to both try to make the neural network converge faster and to avoid overshooting badly when the weights at the neurons grow larger than 1. In that previous post I've looked at the last layer, so then I've figured that I should look at the previous layers: since they don't have a square, they might be easier. Or maybe not: they feed into all the following layers.

But maybe we can make them easier. We can notice that in the original formula for gradient descent every step is multiplied by the descent rate. So maybe we can just pull that descent rate factor and "put it outside the parentheses". Which means that when we propagate the gradient from the last layer backwards, we can multiply it once by the descent rate, and that would allow us to skip multiplying by it everywhere else (which is also an optimization in the software sense!). The result of this multiplication (using r for the descent rate and v for the activated output of a previous layer's neuron)

m = r * df/dv

also has can be seen as having a "physical meaning" that says "I want to shift the output v of the previous-layer neuron by this much to achieve the next value of v" (here t is the index of the training pass):

v[t+1] = v[t] - m.

In reality, of course, the output v feeds multiple neurons of the previous layer, and these feedbacks get added together. But that can be folded into the same physical interpretation too: if this output feeds into N neurons, we can say that

r = R / N

where R is some "basic descent rate", and so

v[t+1] = v[t] - R * sum{i}(df[i]/dv) / N

That is, the addition turns into averaging with a different rate R.

But wait, there also are multiple training cases (K of them), the gradients of which also get added up. That's not a problem, because the outputs from the different training cases also get added up.

An interesting corollary from this seems to be that when a neural network has a different number of neurons per layer, the descent rate should be normalized to the number of outputs, and the layers that have fewer outputs can use a higher descent rate.But I suppose there is also a counter-argument that if there are many outputs, the feedback from them is more likely to be contradictory, so the sum is more likely to be closer to 0 rather than all pointing in the same direction, so a higher rate may be useful there. That's probably why nobody bothers with adjusting the descent rate per layer.

So all right, let's build a simple example. Suppose we have a ReLU-like activation function with gradient of 1 and a neuron with 2 inputs and an offset weight

v = u[1] y[1] + u[2] y[2] + b

And since the training cases are added up linearly, we can just start with an example with one training case. Then we could write the function to minimize as

v = u[1] y[1] + u[2] y[2] +b - m

(getting ahead, this expression turns out to be wrong, but I want to get gradually to why it's wrong). The partial derivatives will be

dv/du[i] = y[i]

dv/dy[i] = u[i]

dv/db = 1

and computing at two points and finding the difference in the dimensions of gradients and arguments:

L^2 = ( (y[1, 1] - y[1, 0])^2 + (u[1, 1] - u[1, 0])^2 + (y[2, 1] - y[2, 0])^2 + (u[2, 1] - u[2, 0])^2 + (1-1)^2) /
( (u[1, 1] - u[1, 0])^2 + (y[1, 1] - y[1, 0])^2 + (u[2, 1] - u[2, 0])^2 + (y[2, 1] - y[2, 0])^2 + (b[1] - b[0]^2)
<= 1

because the lower side of the fraction has all the same terms for u and y (just in different order), and an extra positive term for b. So (1/L) >= 1, and propagating the same rate of descent or more should be what we need.

To test this conclusion, I've made up a small example with concrete numbers and did what the TFOCS library does, do some steps and compute the approximation of L by dividing the difference of gradient between the steps by the difference of weights. And... it didn't work out. Not only it produced L > 1, but I'd go back and adjust the step to be smaller, and the new L would be even larger, and I'd adjust again to an even smaller step, and L will get even larger than that again.

What went wrong? I've got the formula wrong. Looking at the expression

v = u[1] y[1] + u[2] y[2] + b

it obviously has the minimum where all the variables are 0. So yeah, if we use L=1 and step straight to 0, that would minimize it. But we don't want to get to 0, we want it to get to some value (v[t-1] - m). The right formula for minimization should be

abs( u[1] y[1] + u[2] y[2] +b - (v[t-1] - m) )

The absolute value part was missing. And I don't know how to compute L with the absolute value. I guess one way to replace it would be to substitute a square for absolute value:

( u[1] y[1] + u[2] y[2] +b - (v[t-1] - m) )^2

It can probably be done just for the computation of L, even when the neural network itself stays without squares. But I'm not sure.

Another interesting observation is what happened when my steps in the test have overshot the empirically computed L: they would flip the sign of the weights. Which means that the occasional overshoots are important, without them the signs would stay as they had been initially assigned by the random seeding. Even if we could compute and properly use the right value of L, we might not want to, because we'd lose the sign flips.

Thursday, November 3, 2022

V-shaped activation functions what-ifs

I've noticed that a simulation of the square function with a single neuron plus CORNER (V-shaped) activation function got worse after I've clamped the weights to the range [-1, 1]. It just couldn't produce a sufficient derivative to go to the full slope of the quadratic function, so it defaulted to the same shape as ReLU, going at slope 1 for the whole range of inputs [0, 1]. How about if I increase the allowed range of weights to [-10, 10]? Much better, then it could simulate the bend, producing the total maximum slope of about 1.5. So it looks like being able to go to the higher weights might be useful, just needs to be accompanied by the dynamic computation of L and the training rate from it.

Next, it looked like the right-side coefficient of the V can be fixed at 1, since the slope produced by the neuron itself would also work to produce the same effect. As it turns out, this doesn't work well. The trouble happens when the right side needs to change sign. When the right side of the V adjusts, it changes the sign easily. But without that, changing signs in the weights of the neuron itself doesn't work well, it often gets stuck. So this change doesn't work well. It's kind of interesting that when there is one variable, it changes the sign easily, if there are two, they get stuck, but if there are three, they change signs easily again. Though the real reason is probably that two variables are pushing against each other by using each other's weights as gradients, while the third variable doesn't, it just gets driven from the top. So maybe if one of two variables would be just driven from the top, that would prevent two variables from getting stuck too. Will need to think about that.

Next, what if we do a common Leaky ReLU but adjust the offset using the same formula as with the CORNER? (Leaky ReLU has to be used because with plain ReLU the gradients on the left side would be lost, and the computation for he offset won't work). This worked about as bad as normal Leaky ReLU, and maybe even worse, so it's definitely not a solution on its own.

So it looks like the CORNER function does have value. And maybe the right-side coefficient can be rid of if the main neuron weights can be made to change sign more actively, perhaps by making one of them fed the gradient of the top, or by applying the same gradient to them in both ways.

Sunday, October 30, 2022

optimization 8 - the descent rate and V-shaped activation

In the last post I've talked about how the computation with the V-shaped ("corner") activation function sometimes diverges badly. It turns out, the activation function as such is not the reason.

Another change I've done at the same time was to remove the limit on the weight values (or practically speaking, changed it from +-1 to +-1e10). That was the reason for divergence. Set the limit back to 1, and the divergence disappears. On the surface, that limit of 1 was arbitrary, so if we want to model arbitrary functions, it made sense to remove this limit (and in particular, to do a XOR in a single layer, I needed the values equal to 2). But unfortunately there is a connection between the weights (i.e. multipliers) and the rate of descent that can be used. Here is how it works.

There is a way to evaluate, what rate of descent is safe, it's used in the ISTA algorithm and uses the Lipschitz constant L. The safe rate of descent (that doesn't cause us to overshoot the minimum and diverge) is 1/L. L is defined as follows:

L = max( || grad_f(x) - grad_f(y) || / || x - y || )

What is what in this expression:

x and y are two arbitrary points in the multidimensional space of weights being optimized. I.e. each one of them is a vector (i.e. an array in programming terms) of all the weights.

grad_f() is the gradient (i.e. the multi-dimensional derivative) of the function being optimized. It gets computed at the argument values x and y. Note that even though the function produces a single value, the gradient is also an array, of the same size as the function argument (because it contains a partial derivative for each input dimension).

The double-vertical-line "parentheses" mean the norm operation, more exactly the second norm. It's something that I either haven't seen or didn't remember before from college but the norm operation is widely used in optimization. It's meaning is computing the length of the vector in the physical sense. There is this difference between math (and programming) and physics: in math (and programming) the length of the vector is the number of the dimensions in it, in physics it's the length of the arrow that goes from the origin to the point determined by the values of the vector as coordinates in a multidimensional space. So if you want to talk about this "physical length" in math, you'd get confused between two lengths, and thus they use a different word "norm" for it.

There is also more than one kind of norms. In general they are written with the index, like ||x||_2, but if the index is not specified, 2 is assumed. The first norm, that also could be written as |x|, represents the distance as it's walked "in steps" along the coordinate axes. Yes, it's no accident that it's written in the same way as the absolute value: for a single dimension, the absolute value is the distance we walk form the origin to the value on a single axis. So norm 1 is the generalization of an absolute value for a vector. For another example, if we have a vector [4, -3], then we'd walk 4 steps along one dimension and 3 steps long another dimension, and norm 1 will be 7. It can be written as:

|x| = sum( abs(x[i]) )

The second norm uses squares and represents the distance in Euclidean space:

||x|| = sqrt( sum( x[i]^2 ) )

So for a vector [4, -3], norm 2 will be sqrt(4^2 + (-3)^2) = 5. Apparently there are higher norms too but I didn't look at them.

Returning to the formula for L, with all the parts parsed, the meaning of it is "if we do a step from any point x to any other point y, what will be the maximum change of gradients between these points, in relation to the length of the step taken?" This is important for the gradient descent because there we take the steps proportional to the gradient. So we make a step that is based on the gradient, and after the step we get another gradient. And we don't want to overstep the minimum, which means we don't want to go beyond the point where the gradient is 0. So if L is large, meaning that a small step can cause a large change in gradient, we should be taking only small steps, and if L is small, meaning that nothing much changes until a large step, we can take the large steps. Hence the safe rate of descent is 1/L.

For the quadratic polynomial functions, L can be computed as a constant. The expression for L divides the difference in gradient by the difference in x, which is essentially the second derivative, which is constant for the quadratic polynomial functions. But it's a derivative computed from the worst possible point in the worst possible direction, so this constant may underestimate the allowed steps. Consider a simple function

f = 1000 * x[1]^2 + x[2]^2

Here the dimension x[1] with its high coefficient of 1000 is what will drive the constant L to a high value. This value of 1/L will be so good that it will really get us to the optimal value of x[1] in a single step, essentially applying Newton's method! (Possible for the simple functions that don't have dependencies between two variables). But then we'll have to take 1000 steps of this size in the dimension x[2] to get to its optimum. Not great but at least safe. There is a Matlab package for optimization called TFOCS, and what they do is measure the actual L from the recent steps: we've changed x[] by so much, and gradient has changed by so much, we can divide them and assume that it's a reasonable estimation of L for the part of the function where we're traveling. And hopefully we're traveling towards the optimum, where the gradients become smaller, so we can assume that L should be getting smaller too. So in this example, after TFOCS gets the dimension x[1] right, it will detect that in the following steps L becomes smaller, and it can take the larger steps.

We can do it for the more general functions too, where the second derivative is not constant but depends on X, with similar assumption that were' traveling towards the areas with smaller gradients, but they're kind of difficult in our case.

We can write out the loss function to minimize in the simplest case (for one neuron of the last layer) as

f = (x[1] * x[2] + x[3])^2

Here x[1] is one input, x[2] is one weight, and x[3] represents the whole rest of the neuron bunched together: the sum of other products of inputs and weights, and subtraction of the true result from the training case. Since I'd like to avoid dragging around the indexes, I'll rename the variables: x[1] will become x, x[2] will become y, and x[3] will become b:

f = (x*y + b)^2

Then the dimensions of its gradient will be:

df/dx = 2*y^2*x + 2*b*y = 2y(xy + b)

df/dy = [similarly] =2x(xy + b)

df/db = 2*x*y + 2*b = 2(xy + b)

Yes, b counts here as a variable too, cumulative of its internal parts. Let's do the initial computations for the square of L, then we can use the squares of the norms and not mess with roots for now. There let's use the indexes to identify the point, so the dimensions (x[0], y[0], b[0]) will be one point and (x[1], y[1], b[1]) will be the other point. Then the squared norm 2 for the points will be:

|| (x, y, b)[1] - (x, y, b)[0] || ^2 = (x[1] - x[0])^2 + (y[1] - y[0])^2 + (b[1] - b[0])^2

And the squared norm 2 for the gradients will be:

|| grad_f[1] - grad_f[0] ||^2 =
= (df/dx[1] - df/dx[0])^2 + (df/dy[1] - df/dy[0])^2 + (df/db[1] - df/df[0])^2 =
= (2 y[1] (x[1]y[1] + b[1]) - 2 y[0] (x[0]y[0] + b[0]) )^2 +
+ (2 x[1] (x[1]y[1] + b[1]) - 2 x[0] (x[0]y[0] + b[0]) )^2 +
+ (2 (x[1]y[1] + b[1]) - 2 (x[0]y[0] + b[0]) )^2

Finding or even estimating the whole maximum is so far too complicated for me but as the first approximation let's consider what happens if the point moves in one direction only. For example, it moves by x and y[1]=y[0] and b[1]=b[0]. Then the expressions become simpler:

|| (x, y, b)[1] - (x, y, b)[0] || ^2 = (x[1] - x[0])^2

|| grad_f[1] - grad_f[0] ||^2 =
= (2 y^2 (x[1] - x[0]))^2 +
+ (2 (x[1]^2 - x[0]^2) y + 2 (x[1] - x[0]) b )^2 +
+ ( 2 (x[1] - x[0]) y )^2 =
= (2 y^2 (x[1] - x[0]))^2 +
+ (2 (x[1]^2 - x[0]^2) y )^2 + 8 y b (x[1]^2 - x[0]^2) (x[1] - x[0]) + ( 2 (x[1] - x[0]) b )^2 +
+ ( 2 (x[1] - x[0]) y )^2

When we divide the squared norm of the gradient difference by the squared norm of points difference, we get:

L^2 =4 y^4 + 4 y^2 +
+ 8 y b (x[1]^2 - x[0]^2) / (x[1] - x[0]) +
+ 4 b^2 + 4 y^2

And there is a similar derivation for the other dimensions. This expression works out not so bad for |x| <= 1 and |y| <=1. In this situation we're guaranteed (x[1]^2 - x[0]^2) / (x[1] - x[0]) < 1, and all of the 2nd and 4th powers of y being < 1, so the result is limited to

L^2 <= 4 + 4 + 8 b + 4 b^2 + 4

and so

L <= 2 * sqrt(b^2 + 2 b + 3)

That's only one dimension for calculation of L, by x, but the other dimension by y is similar, and for dimension b the formula reduces to an even shorter one

L <= abs(2 * b)

So if we combine the dimensions (I'm not sure if this is the right thing to do but I guess close enough) the full L will be

L <= 2 * sqrt(3 b^2 + 4 b + 6)

This explains why in my small examples with very few neurons the training rate of 0.1 per round (divided by the number of training cases in a round) generally works well.

But if the weights are allowed to be over 1, then all of a sudden we get squares of large values, and L grows quite large, and 1/L becomes very small.

So I've tried to limit the weights to +-1, and lo and behold, the examples that used to be badly diverging, are now converging very nicely, and produce a decent approximation of a parabola with an error value quite close to a manually-built approximation.

I guess this means that much of the training speed can be gained with the right dynamic estimation of L. And I get the feeling that when we propagate the gradients backwards through the layers, and multiply them by the coefficients smaller than 1, this means that not only the gradients diminish but also the L value should be shrinking, and the learning rate of 1/L should be growing. But working this out will take more time and experimentation.

Monday, October 24, 2022

V-shaped NN activation function 2

I've got around to try the non-standard activation function I've described in the last post, I've named it "corner" function. And there is some bad news, some good news.

First, since I've added support for arbitrary activation functions to FloatNeuralNet, I've tried the "leaky ReLU" too. It's like ReLU but the left side has a small positive gradient instead of 0. On the weird task I've been playing with (simulation of a smooth quadratic function) it didn't work any better than normal ReLU.

Then I've tried my corner function, first just a single neuron. And the result was mixed. Sometimes it did work decently, sometimes it didn't. It didn't when the breaking point happened to be outside of my range of input values. Just as I suspected, there is no consistent force that adjusts the breaking point's position horizontally, If it gets into range of training data, it does something (not great either), but if it isn't in range to start with, it does pretty much nothing, and the function reverts to doing a linear approximation.

So I've tried to solve this empirically. I've thought of measuring the amounts of error on the left and right side of the breaking point. Then if we have more error on one side, this probably means that this side is more non-linear than the other side. And we could approximate it better if we had to deal with fewer sample points on the more non-linear side, so we have to move the breaking point toward that side, until some points shift to the other side and the balance returns. This sort of reminds me of AdaBoost that tries to use the result of a partial solver to balance the results by some criterion, but obviously it's not quite the same thing.

To give an example in beautiful ASCII art, in a situation like this, where the breaking point is marked with "@", the line produced by the neuron with "-" and the training points with "*":

                   * --
                   --
---- *           --
 *  ---- *     --
      * ---- --   *
           *@  *

shifting the breakin gpoint to the right would allow to do a better match on the right side without messing up the left side:

                   * -
                    -
---- *             -
 *  ---- *        -
      * ----     -*
           *---*-
               @

For the measure of the error we can use the absolute value of sigma (sigma is the error gradient propagating from the higher layer). It's absolute because we don't care if the point is above or below our computed value, that part would be sorted out by optimization of the other weights. And the thing that shifts the breaking point is the bias weight in the neuron. So if the value produced by neuron before activation is positive (i.e. on the right side from the breaking point), we can use the gradient with abs(sigma), if negative (on the left side) then with -abs(sigma).

The question is, how exactly to fold it into the gradient. Just using it as a gradient seems to have too strong an effect. Except when the neuron produces a signal that is always very close to 0, then it's not enough, because sigma becomes very small. Adding a small constant of the same sign solves this particular problem. But the rest of the time it still seems too strong. For that I've tried to take a fraction of the absolute sigma. Then I've tried to mix it with the traditional gradient. Both worked better, but then there is the question of what fraction to take and at what proportion to mix? I don't know yet how to figure it out, and I'm not sure that the empiric values transfer well between different examples. So the other way I've tried was to compute

gradient = output_before_activation * abs(sigma)

It seems to work decently well but it prioritizes the far away points over the near ones, and I'm not sure if this is the right thing to do, it's rather counter-intuitive.

With this, an one-neuron model works quite well. The problems start with more neurons - sometimes they do converge quite well, much better than what I've seen with ReLU, and sometimes not. One weird thing that happens is that the model goes converging quite well for a while, and then all of a sudden the error jumps up, then continues converging. Sometimes the training rate turns out to be too high, and the model explodes. Maybe those jumps in the other cases are also results of the training rate being too high at some points, which causes these jumps. Or maybe as the breaking point shifts and a training point ends up on a different side of it, that causes a strong jump. Or maybe both are connected, I don't know yet. Sometimes the training rate turns out to be too low - this generally happens when a neuron is driven to produce near 0, which tends to happen when the random seeding makes the V point upside down, and then as it tries to rights itself, it sometimes ends up with a combination of weights that shrinks the gradients. This is not a unique problem, this is something that can happen with any neural network. By the way, this time the variant that computes the gradient for the whole training pass before adjusting the model tends to be more stable than one that makes a step after each training case.

So where does it leave me? It looks like the whole thing can benefit from some momentum method to get out of the low-gradient situation, and from the auto-detection of the learning rate that is too high in the high-gradient situations. And maybe to think of some other way to prevent the wedging on 0s. The empiric solution for positioning the breaking point needs to be worked through the math, and maybe that would suggest a better solution too. And maybe the same way of adjusting the bias weight can be applied to the classic ReLU too (only the left side that generates 0 would probably have to use the sigma that didn't get zeroed by that yet).

Thursday, October 20, 2022

V-shaped NN activation function

In my previous post https://babkin-cep.blogspot.com/2022/10/optimization-7-follow-up-to-neural.html I've had the idea that maybe there should be a V-shaped activation function. After some thinking, I think I've got an idea of what this function could be. Let's call it A(x):

A(v) = {
    v < 0 : k1*v + k2;
    v = 0 : k2;
    v > 0: k3*v + k2;
}

It's not necessarily V-shaped as such, it just consists of two linear segments joined at a breaking point. k2 shifts this point vertically. Note that in the neuron formula (using the same variables as in https://babkin-cep.blogspot.com/2022/10/optimization-4-backpropagation-for-last.html):

N(v[]) = sum(v[i] * x[i]) + x0

the offset x0 also shifts the result vertically, but does that before activation, effectively shifting the breaking point left or right, while k2 does it after activation and shifts the result up or down.

If the values happen to be k1=0, k2=0, k3=1, then A(x) is equivalent to ReLU. But a major difference here is that k1, k2, k3 also get adjusted by backpropagation. It nicely fits into the backpropagation model: the function is not contiguous but has a derivative at every point, and this derivative is generally not 0, and it's also a linear function, so it can even be applied after the last layer of neurons.

Applying the same sample logic for the last layer as in https://babkin-cep.blogspot.com/2022/10/optimization-4-backpropagation-for-last.html, if we have

N(v[1..3]) = v[1]*x[1] + v[2] * x[2] + v[3]*x[3]

then after activation, assuming that f() < 0, the formula will be:

output = A(N(v[1..3]) = k1 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3]) + k2

Then the squared error will be:

f(v[1..3]) = (k1 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3]) + k2 - b)^2

and using the same logic as before, the partial derivatives will be:

df/dx[1] = 2 * k1 * v[1] * (k1 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3]) + k2 - b) =
= 2 * k1 * v[1] * (output - b)

df/dk1 = 2 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3])
* (k1 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3]) + k2 - b) =
= 2 * N(v[]) * (output - b)

df/dk2 = 2 * (k1 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3]) + k2 - b) =
= 2 * (output - b)

And when propagating to the previous layers, the propagation back through the activation function will multiply the sigma value by its derivative:

dA/dv = {
v < 0 : k1;
v = 0 : 1;
v > 0 : k2;
}

and the gradient (AKA derivative) for k1 or k3, depending on which path was taken, will be (N(v) * sigma), and for k2 it will be sigma.

Another interesting property of the V-shaped function is that it can implement a XOR logical operation on two inputs in one layer instead of the classical two. If we have false defined as -1 and true as 1 then we can define:

N(v[1], v[2]) = v[1] - v[2]

k1 = -1

k2 = -1

k3 = 1

It could also be argued that this activation function essentially adds an extra layer to every layer of neurons, but it's a special layer, much cheaper than a normal one. I think implementing an equivalent of V-shaped activation with plain ReLU would even require three layers to keep k2 common.

So it looks interesting and promising, although I'm not entirely sure that the breaking point will end up moving in the "intuitively correct" directions. Now I need to try and see. And first, reorganize the training code in FloatNeuralNet to support multiple kinds of activation functions, and factor out the repetitive code in it.