Tuesday, October 17, 2023

comparing the FloatNeuralNet options

I've done a plot of the errors with multiple variations of the FloatNeuralNet options, using the Leaky ReLU as the least common denominator. I've added some too: for example, I've never tried before to do the momentum method without my improvements, so I've done a custom build without them. The results are interesting, I should use more graphs in the future (and add a proper graph generation rather than scraping the debugging log).

First of all, none of non-stochastic implementations have done as well as the stochastic one. Well, they do better for the same training rate, but the catch is that the stochastic approach is stable at a much higher training rate. It works fine with a 100 times higher training rate for the same initial seed, and starts getting rough but still even faster at a 1000 times higher training rate.

The second observation is that without my changes the momentum method doesn't work at all. At all. It just breaks down right away.

The third observation is that my auto-adjustment of the training rate starts quite fast, somewhere between stochastic method's 100x and 1000x training rates but then flattens out and falls behind. So even as-is it's probably a valuable technique for quick short training. 

As for why it flattens, my hypothesis is this: Imagining the optimization as a ball rolling down the hill along with the gradient, killing the momentum in the dimensions where the gradient keeps changing sign prevents the ball from wobbling in the trough and jumping out of it. It works essentially as a shock absorber. But what if the trough doesn't go predominantly in one dimension? What if it's going at a diagonal in multiple dimensions? Then the damping that prevents wobbling will also be slowing down the ball. This is I think what happens when the optimization slows down. Fortunately, the mathematicians have a way to find the dominant direction of the trough and rotate to it, the Principal Component Analysis does that. Then, I guess, the same momentum method can be applied to the rotated dimensions, and then rotated back. Unfortunately, this whole procedure is probably quite expensive. And as the ball rolls, the rotation will change, so the post-rotation dimensions will change, and it's not clear, how to match them between iterations. Maybe the rotations should be recomputed only infrequently, when the progress slows down: this will both amortize their costs and keep stability between iterations.

Sunday, July 2, 2023

trends

Some time ago Linkedin started sending weekly e-mail with the trends for "your kind of jobs". Every week since it's been reporting a 10% drop in job postings, then went over 10%. Well, last week it held at 0, then this week it dropped another 13%. 

At the same time the government reports tout the job growth. So, is it a drop in high-paying jobs and growth in low-paying jobs? (Which is not exactly a great thing). As it happens, I know someone who runs a job posting web site that is used predominantly for low-to-mid paying jobs. And guess what, he says that the postings on his site have also sharply dropped around April. So no, not a growth of low-paying jobs either.

 What does it all tells us about the government statistics? As they saying goes, "lies, damned lies, and statistics". Maybe they'll "revise" it a few months later as they've done with statistics on income after inflation, where the loudly touted slight growth above inflation had quietly turned into a ~5% loss.

Saturday, June 3, 2023

the first actual use of Triceps

 I've recently got Triceps used for a real application, for the first time ever, after approximately 13 years. Not really a production usage, rather a scientific demo, but it's an outside project, not part of Triceps itself. Embedded in Matlab, or all things. No, not a full Matlab API, but an application written in C++ embedded as a native function into Matlab. Which is exactly what Triceps was designed for.

But really Matlab and Triceps with a full Matlab API might be a match made in heaven. Matlab really sucks at two things: the data structures where you can append and search by value, and parallelism. And those are things where Triceps excels. Yeah, Matlab has the parfor loops but the trouble there is that the parallel threads (they're more typically processes but logically still threads) are stateless. All the input data is packed up and sent to them (at a great overhead), and then the results are packed up and sent back. You can't just preserve state in a thread between two calls, it has to be sent from scratch every time. And no, it doesn't seem to be the same constant in shared memory read in parallel by multiple threads. It actually gets copied for every thread. So parfor only works well when you send a small amount of data, process it for some long time, and then send a small result back. Not well when you want your thread to make queries to a large database. But keeping state is what a Triceps thread does. The Triceps threads are also easy to organize in pipelines. (Yeah, Matlab does have some pipelines in one of extensions, but they seem as cumbersome as parfor). And the read-only shared memory would work too if queried through Triceps and only small results of the queries get returned to Matlab. It could work really, really awesomely together. The trouble of course is that I personally don't have much of an incentive to do this work.

That's the good part. What went badly? Well, Triceps in C++ feels rather dated. It's a project that was started in 2010, before C++11, and it feels like that. I didn't even aim for C++ to be the main language, but more as an API for embedding into the other languages. But now it can be done much more smoothly straight in C++. So if I ever get to it, a modernization of the C++ API is in order. Some libraries are dated too, in particular NSPR. It also proved to be a pain in building: I haven't thought much about it before, but the definitions used in building of the applications that use Triceps have to be the same as when building Triceps itself. So if triceps is built with NSPR, and the application doesn't include the right definitions for the Triceps headers to use NSPR, it crashes in a surprising way. Fortunately, the standard C++ library now has APIs for the atomic values, so transition to that API is now in order. On the other hand, shared_ptr is a more complicated question, and keeping the Triceps Autoref might still be better for efficiency.

Saturday, April 15, 2023

Bayes 29: computation for numeric values (and Python)

As promised in the previous installment, here is the how the data science handles the Bayesian inference directly on the numeric values. Remember from that last installment, the formula for inference by weight is:

W(H[i]|E) = W(H[i]) * P(E|H[i])

So what they do is instead of using the probability P(E|H[i]), they use the probability density p(E|H[i]) (note the lowercase p) from a previously computed distribution, this time E meaning not just a specific event variable but a specific numeric value of a specific variable. Which really makes sense, since that's what the probability density means: the probability that this specific value would come out in the distribution for this variable. Since it's a conditional distribution, only the training cases for the hypothesis H[i] are used to compute the distribution.

The typical distribution used here is the normal distribution. For what I understand, it's not only because it's the typical one occurring in reality, but also because for any original distribution, once you start repeatedly pulling the examples from it, the values produced by these examples become described by the normal distribution (if I remember right, this comes out of the Central Limit Theorem). The normal distribution has two parameters: mu (the mean value) and sigma (standard deviation, equal to the square root of the variance). The probability density function then is:

p(x) = (1/sqrt(2*pi*sigma^2)) * exp( -(x-mu)^2 / (2*sigma^2) )

Or, since sigma here is always squared, we can also replace the squared sigma with the variance var:

p(x) = (1/sqrt(2*pi*var)) * exp( -(x-mu)^2 / (2*var) )

The values of mu and var would be different for each hypothesis H[i]. So we'd split all the training cases into subsets by mutually exclusive hypotheses, and then compute:

mu[E][i] = sum(cases[i][E]) / N[i]
var[E][i] = sum((cases[i][E] - mu[E][i])^2) / (N[i] - 1)

Here N[i] is the count of the training cases with the outcome H[i], and cases[i] are all the cases with that outcome, effectively N[i] = count(cases[i]). The reason why we use (N[i] - 1) instead of plain N[i] is that it computes not a variance but a sample variance. This comes out of a rather interesting distinction between the probability theory and statistics: the probability theory describes the random distributions where the parameters of these distributions are known in advance, while statistics describes how to deduce these parameters from looking at the samples of the data. Here we obviously don't somehow know the distribution in advance but have to deduce it from looking at the training cases, i.e. the data samples, so we have to use the statistics and its sample variance. Technically speaking, mu here is also not the true mean but the sample mean, but the formula for it is the same anyway. However the need to divide by (N[i] - 1) comes from the sample mean producing a different value that is offset from the true mean, sample variance counter-adjusting for this offset. And the reason for this offset is that we do the sampling without replacement (i.e. each training case is used at most once in the computation). If we did the sampling with replacement, for example select 1000 values at random from the training cases (each time from the whole set of the training cases, not excluding the cases that have already been selected but allowing them to be selected again), then when computing the sample variance we'd use the plain N[i], i.e. the same 1000 rather than 999. Or if we magically knew the true mean, we'd use N[i] for the sample variance too.

And here is a good spot to talk about Python. As much as I detest the language, I see why it's popular for the data science: because it has some very good libraries in it for this purpose: numpy, sklearn, matplotlib. They've even added an operator of matrix multiplication @ into the language for the convenience of these libraries. And the point of the matrix operations is not only that they allow to write down things in a form that is short and comfortable for the data scientists (although confusing as hell for the rest of us), but also that they allow to move all the tight loops into the C++ code that is way faster than the Python code, and also allows the library to internally parallelize the computations as needed. So the computations in a slow interpreted language with questionable parallelism become quite fast and quite parallel. Aside from that, the libraries provide other conveniences. Sklearn even has a bunch of popular data sets built right into it. So in Python the code to compute mu and var looks like this:

# X is a numpy matrix, where each row corresponds to a training case
# and each column to an input variable (i.e. a numeric event);
# Y is a numpy vector containing the outcomes for all the training cases

# Find the subset of the training case that have the outcome i,
# it produces a vector of the same size with True for
# each included case and False for each excluded case.
selector = (Y == i)

# Select the subset of training cases for i, keeping all the columns
subset = X[selector, :]

# Count the included cases, which is the number of rows in the subset
n = subset.shape[0]

# Compute the sample mean, by averaging across all rows (axis=0). 
# The result is a row with a column per input.
mu[i] = subset.mean(axis = 0)

# Compute the sample variance, again producing a row with a
# column per input. The tricky part is that mu[i] is a single row,
# so the subtraction operator automatically considers it as a matrix
# that has as many rows as subset, with all the rows being the same.
# Sum adds up the columns across all the rows (axis=0) into one row.
# And division by a scalar divides each column by the same scalar.
# The result is a row with a column per input.
var[i] = np.square(subset - mu[i]).sum(axis=0) / (n - 1)

But wait, there is more. Note how the probability density function has an exponent function in it. Instead of computing the weights by taking the exponents and multiplying, we could compute the logarithms of the weights, by adding up the logarithms (and the logarithm of the exponent is the argument of the exponent, saving this function computation). So the formula becomes:

logW(H[i]|E) = logW(H[i]) + log(p(E|H[i])) =
  = logW(H[i]) + log(1/sqrt(2*pi*var[i, E])) 
             - (x[E] - mu[i, E])^2 / (2*var[i, E])

The part log(1/sqrt(2*pi*var[i, E])) is a constant that can be pre-computed in advance, so the computation with a few multiplications and additions is quite efficient.

Monday, April 10, 2023

Bayes 28: computation by weight revisited

I've found out recently how the people in the data science compute the Bayesian inference for values from an arbitrary numeric range (as opposed to yes/no events). As it turns out, they do it using a computation by weights. I've been wondering for a while after discovering the computation by weight on my own, why nobody else uses it, it's so much simpler than by probabilities. So the answer is similar to what I've discovered before for the connection between the Bayes and neural networks: the people in the field do know about it and do use it, only the simplified popular explanations don't.

I wanted to write down the explanation of how they do it (at least, for myself in the future), and so I went back to read what I wrote before about the Bayesian computation by weight, and found that what I wrote before is quite complicated. I wrote it down as I discovered it when experimenting with all those extra parameters that make a Bayesian system work in reality, and so that explanation is also full of discussion of those parameters. Also, I've learned some things since then.

Before going into th enew ground, let me try a new take on the same thing: discuss the Bayesian inference by weight, only now in its most basic form.

Let's start again with an event E and hypothesis H. In the modern terms, they call this machine a Bayesian classifier, and call the mutually-exclusive hypotheses classes. The events are the inputs, and based on the values of the inputs the machine is trying to decide, which hypothesis is most likely to be true. The classic Bayesian formula for a binary event E is:

P(H|E) = P(H) * P(E|H) / P(E)

Here P(H) is the prior (i.e. previous) probability that the hypothesis is true, P(E) is the prior probability that the event E will be found true (after we learn the actual value of the event this probability collapses), P(E|H) is the conditional probability that the event would be true for the cases where hypothesis H is true, and finally P(H|E) is the new (posterior) probability of H after we learn that E is true. If we learn that E is false, we would compute P(H|~E) instead, for the complementary event ~E.

Where did that formula come from? Well, when talking about probabilities, it really helps to draw a diagram that starts with the range of all the possible cases and then splits it into parts with the appropriate weights. Let's draw this field as a square, in beautiful ASCII art. And then split it with a horizontal line so that the area above the line matches the number of the cases resulting in the hypothesis H1 being true, and below the line with H1 being false (i.e. the complementary hypothesis ~H1, which we'll also name H2, being true). This is for a very simple classification with two complementary hypotheses: H1 = ~H2 and H2 = ~H1.

+---------------+
|               |
|               | H1
|               |
+---------------+
|               |
|               | H2 = ~H1
|               |
+---------------+

Now let's do the same, but split the same square vertically. The left side will match the event E1, and the right side will be its complementary ~E1:

+---------+-----+
|         |     |
|         |     |
|         |     |
|         |     |
|         |     |
|         |     |
|         |     |
+---------+-----+
    E1      ~E1

Now let's take the second picture and split each vertical part horizontally, into two parts, that again match H1 on top and H2 on the bottom. I've marked these parts with the letters a, b, c, d.

+---------+-----+
| a       | b   |
|         |     |
|---------|     | H1
| c       |     |
|         |     |
|         |-----|
|         | d   | H2
+---------+-----+
    E1      ~E1

Here the parts a and b correspond to H1, and their total area is equal to the original area (within the ASCII-art limitations) of the upper part of the split in the first picture. The parts c and d correspond to H2, and again the sum of their areas is equal to the original area of the lower part. But obviously they're split differently on the left and right sides, meaning that if we know that E is true, H1 has a lower probability than H2, but if E is false, H1 has a higher probability. And if we don't differntiate based on E1, the left and right parts average out.

The areas of these parts a...d are the weights of four sub-divisions:

a: E & H1
b: ~E & H1
c: E & H2 (or equivalently E & ~H1)
d: ~E & H2 (or equivalently E & ~H1)

The reason why I prefer to use H2 instead of ~H1 is that this notation allows to generalize more obviously to more than two hypotheses: H3, H4, and so on. Each additional hypothesis would add two areas to the picture, one on the left and one on the right.

Now we can express the probabilities through relations of these areas (and areas can also be called weights):

P(H1) = (a + b) / (a + b + c + d)
P(H2) = (c + d) / (a + b + c + d)
P(E) = (a + c) / (a + b + c + d)
P(~E) = 1 - P(E) = (b + d) / (a + b + c + d)
P(E|H1) = a / (a + b)
P(E|H2) = c / (c + d)
P(H1|E) = a / (a + c)
P(H2|E) = c / (a + c)
P(H1|~E) = b / (b + d)
P(H2|~E) = d / (b + d)

The general principle for P(x|y) is that the area that satisfies both conditions x and y becomes the numerator, and the area that satisfies y becomes the denominator.

Alright, let's substitute these formulas into the Bayesian formula:

P(H1) * P(E|H1) / P(E) = P(H1) * (1/P(E)) * P(E|H1)
  = ((a + b) / (a + b + c + d))
    * ((a + b + c + d) / (a + c))
    * (a / (a + b))
  = a / (a + c)
  = P(H1|E)
  
P(H2) * P(E|H2) / P(E) = P(H2) * (1/P(E)) * P(E|H2)
  = ((c + d) / (a + b + c + d))
    * ((a + b + c + d) / (a + c))
    * (c / (c + d))
  = c / (a + c)
  = P(H2|E)

So that's where that formula comes from and how it gets proven. Note that the computation of both probabilities involves the final division by the same value (a + c). If we just want to compare them, this division by the same value doesn't matter and we can skip it. Instead we'll just get a and c, which are also the weights W(H1|E) and W(H2|E), and if we want to get the probabilities, we can normalize by dividing by their sum. Or I guess rather than W(H1|E) we could say W(H1 & E) but that's the beauty of workiing with the weights instead of probabilities: the probabilities require normalization at each step, dividing by the sum of all possible weights, while with weights the sum is kept implicit, and W(H1|E) = W(H1 & E) = W(E|H1). When expressed in weights, the formulas become simpler:

W(H1) = a + b
W(H1|E) = W(H) * P(E|H1) = (a + b) * (a / (a + b)) = a

That's pretty much it. But there is one more question to consider: what if we have more than one event? We usually do have multiple events. After we compute P(Hi|E1) (or W(Hi|E1)), now we have a smaller rectangle left, with the level of the horizontal divider shifted from where it was in the previous square (or rectangle). What do we do with the next event? There are multiple ways to look at it.

One way is to hope that the events are completely independent from each other. This basically means that as we look at each part produced on splitting by E1 (E1 and ~E1), and further split each of these parts by E2, the horizontal lines in each vertical quarter shift according to the current level of H in the part being split, with the result that P(E2|E1,Hi) = P(E2|Hi). It would look something like this:

+----+----+--+--+
| e  | f  |g |h |
|----|    |  |  |
|    |    |  |  | H1
|    |----|  |  |
|    |    |--|  |
|    |    |  |  |
|    |    |  |--| H2
+----+----+--+--+
 E1   E1   ~E1 ~E1
 E2   ~E2  E2  ~E2

The equality P(E2|E1,H1) = P(E2|H1) = P(E2|~E1,H1) would imply (even though it doesn't quite look like it in ASCII art):

e / (e + f) = a / (a + b) = g / (g+h)

That's a simple-minded implementation (in the scienific circles they call it naive, sometimes even spelled with two "i"s). The problem there is that when the assumption is not true and the events are strongly dependent, this can drive probabilities in weird ways.

Another way would be to build the tree of exact splits: split the slices produced by E1 into slices for E2, then for E3, and so on, and for each vertical slice after each split find the exact proportion of cases. This is obviously much more complex, and complexity grows exponentially with the number of events.

The third way (I think I've just realized it from reading the materials about data science) would be to track the splits by events pair-wise, do the vertical splits by each pair: (E1, E2), (E1, E3), ..., (E1, En), (E2, E3), (E2, E4), ..., (E2, En), and so on. I haven't quite thought through yet the adjustments that would need to be computed for each split. I guess, fundamentally it would be a representation of the covariance matrix (another word I've learned). But I guess there are two potential ways to do it: one would be to adjust the positions of the horizontal lines after the second split, another would be to adjust the positions of the vertical lines for the second split. I'm not sure which one is more correct. Maybe either way would adjust the ratio of the areas on the left and right to the same value. But either way, after you apply E1, adjust the probabilities of E2, E3, and the rest of the events according to this matrix. Then after applying E2, adjust E3, E4, and so on. It won't be a perfect match that can be produced with the exact splits but it would easily catch the events that are duplicates and near-duplicates of each other.


Monday, March 6, 2023

VNC How-to

Theoretically speaking, an X terminal can work remotely through a tunnel with ssh -X. But in practice it does that very, very slowly. I don't understand why but they do a huge number of synchronous requests, which become very slow when the RTT is high. The thing that allows to use it over a long distances is called VNC. Its documentation is surprisingly poor, and so are the recorded talks, but I've finally figured it out.

All the VNC varieties grow from one source, that used to be Open Source but produced by a commercial company (NX). At some point they've stopped opensourcing it, but the previously published Open Source version started living its own life (FreeNX). The next major branch was TightVNC, and then TigerVNC branched off (but for what I undertand, TigerVNC is still backwards-compatible with TightVNC). Nowadays either TightVNC or TigerVNC or both are included in the Linux distributions as packages.

Running is fairly easy. On the remote machine start:

xvncserver -geometry 1800x1100 -alwaysshared :1

Change the geometry and display number to taste (or just skip the display number altogether, it will pick the first free one). "-alwaysshared" means that it will allow multiple parallel connections. For some reason it doesn't allow to set dpi (the display resolution) but you can make a copy of the script and add the option -dpi to the X server command (but it also looks like almost  nothing nowadays pays any attention to the DPI set in the X server).

You can start this from an SSH session, and it will keep running after you close the session. It doesn't use SSH tunneling but opens its own socket, and does its own encryption and password on it (it asks to create the password on the first start). Caveat: apparently some 10 years ago a bug was found where VNC allowed bypassing the password on connection. So better not open it to the wide Internet, just in case, but then connect the display through an SSH tunnel that forwards to the VNC server (it's easy, just one option on the client).

To kill the server later, do

xvncserver -kill :1 

To connect to the server, run on the display side:

xvncviewer -via user@gateway host:1

Here host is the name of the host with the VNC server, and "-via" means to go through an SSH tunnel before connecting to the host. The screen and all applications stay alive between the connections, which is awesome (just like RDP).

That's it, just two commands. It's somewhat inconvenient that Alt-Tab for window switching gets consumed by the local machine, but you can redefine an alternative combination on the remote machine instead (at least in the civilized session managers like MATE or Cinnamon). 

A little on how it works: just as you could have expected, it starts two proxy X servers derived from Xnest, one on the remote server, one on the display side. So most of the synchronous requests get handled locally on one or another side. And the remote server stays alive all the time, so it preserves the session state between connections. But there is more to it, since the protocol between these two X servers is not the regular X protocol under encryption but a modified one, since the NX times. There are even multiple versions of these protocols, but fortunately the client and server are smart enough to negotiate them, and it just works.



Saturday, February 18, 2023

arrested bouncing in FloatNeuralNet

It's been a while since I wrote any updates about my experiments with optimization of neural networks. Partially, it's because I've been busy with other things, and partially because I've tried something a few weeks ago, it didn't work out well, and I wasn't sure what to do with the results. On one hand, it didn't work out, so carrying this complexity into all the future code would be annoying, and even gating this code behind an option would create unpleasant effects on the performance. On the other hand, I think these experiments were really interesting, and maybe I'd want to do another variation of them in the future, so I didn't want to just discard the code (I've already discarded some code  in the past only to re-create it later). So I've been stuck thinking what to do with it, and I couldn't do the other experiments in the meantime.

I think I have a good solution for this now: the "mini-branches". I've copied the original and modified code, and the description into cpp/nn/xperim, one subdirectory per "mini-branch", where it can be found, and diff-ed, and reapplied if needed. 

The mini-branch for those last experiments is https://sourceforge.net/p/triceps/code/1803/tree/trunk/cpp/nn/xperim/001802-bounce/ (1802 is the root revision for this mini-branch in SVN on SourceForge).

Both of these experiments center around the momentum descent. To recap on the previous state, I've been using the momentum descent loosely based on FISTA, with some extensions. One of these extensions is that whenever the gradient changes sign in some dimension, I kill the momentum in this dimension. Another is that I use these gradient sign changes as an indicator to auto-detect the safe basic gradient descent rate: I keep growing this rate until many (for some definition of "many" that I've been trying to tune with various experiments) dimensions start flipping the gradient signs after each step, then I back it off, then continue growing again.

The premise of the first experiment started with me noticing that I've unintentionally changed some logic from FISTA. FISTA has an interesting computation for the coefficient "nu" that determines how much of the momentum is used in each step. It starts with not having any momentum to apply on the first step, then on the second step it skips the momentum too, then it bumps nu to 1, and then gradually reduces it to "apply the brakes" and avoid circling too much around the optimum. But killing the momentum on the gradient sign change serves as an alternative application of the brakes.So I've removed that gradual reduction of nu, and things worked better. 

But I've noticed that I didn't handle the second step right. It wasn't keeping the momentum at 0. My guess has been that this was done to avoid including the large first step in random direction from the random initial point into the momentum. So I've set to fix it. But it made things worse, the error reduction had slowed down. Evidently, the first step is not so random after all. Or maybe my momentum-killing code is very good at eliminating the effects from the truly random first steps, and the remaining ones are useful.

The premise of the second experiment was to detect the situation when some dimension has the sign of its gradient bouncing back and forth on each step without reducing much in absolute value. This is what happens when the model starts tearing itself apart due to the descent step being too large: it keeps overshooting the optimum by a larger and larger value. It could also happen because of the momentum, but the momentum already gets reset to zero in my code when the gradient sign changes, so when we detect a gradient dimension change signs twice in a row, and get to more than 0.75 of the original value, that's because of the large descent rate. So how about we make only half the originally intended step in this dimension? It should land us much closer to the optimum. That's what I've done in this experiment.

And it didn't work out either, the convergence became slower. The potential explanation for this effect is the following: Consider a trough with a ball rolling down it. The trough is not exactly straight, and the ball isn't started exactly straight, so as it rolls down the trough, it keeps wobbling in the trough left and right, overshooting the center one way then the other way. As long as the ball doesn't go into resonance and starts wobbling more and more (eventually falling out of the trough), this wobble is actually not such a big deal. But when I dampen the wobble, it also dampens the speed of rolling down the trough, and since the momentum also gets killed on every wobble, the whole descent slows down. 

So it's something to think about for the future. It probably means that the speed could be improved by not killing the momentum on every change of the gradient sign, but only when the absolute change of the gradient becomes too large. But I'm not sure yet, how can this "too large" be defined reliably.

It likely also means that going to the individual descent rate coefficients for each dimension (something that I contemplated from both my own experiments and from reading the papers) might not work so well, because it's likely to cause the same kind of slow-down. We'll see, I'm now unblocked to try more experiments.

not MNIST

 I've been reading a little more about what other people do with the MNIST recognition. Apparently, doing the transformations on the training set is fairly common, so maybe a simple linear stretching and shrinking to multiply the number of examples would be good enough.

But the most interesting discovery has been that the dataset I've been playing with is not MNIST. Surprise, surprise. :-)

 The MNIST data set is much larger, and also has a higher resolution. It might be the older NIST dataset (without the "M" that stands for "Modified"), or maybe not even that. For some reason I've thought that the MNIST set comes from the ZIP codes recognition, so that's what I was looking for, the ZIP codes dataset, and apparently I've thought wrong. An interesting thing about the original NIST dataset is that it had the training set and the test set collected from different demographies (one from schoolchildren, another from employees of a government agency), so I guess if this is what I've got, it would explain why the sets don't represent each other so well. That was apparently a known major complaint with the older dataset that got straightened in the new modified one.

Sunday, January 15, 2023

trapeze to btimap for MNIST

Continuing with the trapeze-based implementation of the handwritten images that I've described in the previous post, next I've tried to do a reverse transformation: from the set of trapezes to a bitmap that tries to preserve the significant features detected by the trapezes. And if you've read the previous post, you've probably can guess the result: it became a little worse yet!

Looking at the images that got misrecognized, I think  I understand why. Think for example of a "4" and "9". "4" is often drawn with an open top. But when we draw a "9", there is also sometimes a small opening in the top right corner. When I do the detection of trapezes, this little opening becomes a significant feature, and when I convert these trapezes back to a bitmap, this little opening becomes a larg-ish opening, so now "9" became more like "4"! I guess, with a large and representative enough training set, they could be differentiated well. But not if all the "9"s with an opening are in the test set (and yes, if I fold the test set into the training set, it gets recognized quite well).

While looking at it all, I've also noticed a few more interesting things. I've turned the classifier mode on again (and improved it a bit to make the gradients more stable), and noticed that it drives the error on the training set down a lot. The classifier mode tries to make sure that the training cases that produce the wrong result get more attention by multiplying their gradients, which is essentially equivalent to adding 99 more copies of the same training case. Turn the classifier mode on, and the training error goes down from 0.06 to 0.025 in almost no time. Which means that a lot of this error is caused by the few outliers. And maybe another way to fix it would be to take a larger power in the loss function, say the 4th power instead of square. But the error on the test set doesn't budge much, so maybe this is a misleading measure that doesn't matter much and can cause overfitting.

I've found a bug in my code that computed, how many cases got recognized with a very low confidence: if even with the right outcome the absolute value for the highest outcome was still below 0. Due to this bug, all the cases that were labeled with "0" were counted as low-confidence. So this measure wasn't 17.5% on the training set, it was more like 1%, and in the classifier mode goes to near 0. But on the MNIST test set, the total of misrecognized and low-confidence cases is still near 20%.

The auto-adjustment of the descent rate still works well, and I've bumped it up to be more aggressive: changed the step up from 1.1 to 1.2, and the step down from 0.1 to 0.2. Now I think it exhibits the behavior that I've seen with a higher manually-set rate, where once in a while the error would bump up a little, explore the surroundings, and then quickly drop down below the previous low. And the attempts to start the destructive resonance are still well-arrested.

I've added a printout of gradients by layer, and they do tend to vary in the interesting way, kind of "sloshing about". At the start of the training the high layers usually show the high gradients, and the low layers show the high gradients. But this gradually reverses, and after a few thousand training passes you get the high gradients in the low layers, low gradients in the high layers. Unless you then go and change the some training criteria, then the high layers get the higher gradients again until things settle. Which probably means that selecting separate training rates by layer might be worth a try. Or maybe even selecting them separately for each weight, like the Adam algorithm that I've mentioned before does (I haven't tried this specific algorithm yet).


Sunday, January 8, 2023

trapeze data representation for MNIST

 As I've told before, the trouble with MNIST data set is that the test set contains the cases that are substantially different from any case in the training set. So doing well on it requires some kind of generalization on the training set to be able to correctly recognize the test cases. One idea I've had is to parse the images into the logical parts that look like trapezes.

For example, consider an image of the digit 0. It can be broken somewhat like this:

_   /===\   _
   //   \\
_ //     \\ _
  \\     //
_  \\   //  _
    \===/  

Here going from the top we have a horizontal bar that can be seen as a horizontal band containing trapeze of whitespace, followed by a drawn trapeze, followed by another whitespace trapeze. I've separated it by the underbar marks on the sides. Then the sides go expanding: there is still whitespace on the left, then goes the bar expanding to the left, an expanding trapeze of whitespace in the middle, the bar expanding to the right, and another whitespace on the right. Then it goes symmetrically shrinking on the bottom side, and completes with another horizontal bar. There could also be a vertical part on the sides in the middle. Once we recognize the break-down like this, it doesn't matter any more, what exact size is the symbol 0, it's still the same symbol. It gets harder for the many handwritten variations but maybe we get enough samples to build a recognition of all the major ones. So I've decided to try this path wit MNIST.

Recognizing the trapezes in a strict black-and-white image is much easier than with the halftones. So the first thing I've doe was to generate a B&W image by filtering on a fixed level. I've done this, and since I've had it, I've tried to train a model on it. And it did a little worse than on the halftone image.

Well, the next simple approximation is to encode the rows of pixels as run-length. So I've done it, and I've tried to train a model, and guess what, it has done a little worse yet. Which is a bad sign. Part of the problem is that each row becomes of a different size in runs, and to keep the fixed size, the unused runs at the end have to be filled with something. Originally I've filled them with 0s, and then I've tried to fill them with -1. Which made thins a little better but still a little worse than plain B&W image.

At this point the indications were kind of bad but if I didn't try the trapezes, I wouldn't know, right? So I've tried. I've encoded each trapeze with 3 numbers: 

(1) The average slope coefficient of its left side, computed as dx/dy (since a vertical line is possible and a horizontal line is not). Which is also happens to be the slope of the right side of the previous trapeze. It's average because in reality the pixels represent the curved lines, and I just approximate them with the straight trapezes.

(2) The width of the top of the trapeze

(3) The width of the bottom of the trapeze.

To get the ordering of whitespaces and drawings right, I always start each row of trapezes and also end it with a whitespace, and whatever gets left unused gets filled with -1 for widths and 0 for slope.

There is also one exception: since the leftmost whitespace trapeze always has its left slope vertical (i.e. 0), I've reused that value to put the row height into it.

And the result? By now you've probably guessed it, it was a little worse yet (and a lot slower to compute each pass because of the increased input size) than the run-length version. So this representation has turned out to be not very good for generalization, although not terribly bad either. Maybe it can be twisted in some way to make it better. Maybe put it back into a bitmap by treating each trapeze as one pixel, but by now I'm rather pessimistic on these prospects :-)

While experimenting with it, I've tuned the autoRate2 coefficients to make them a little more aggressive, and they had proved themselves quite well. To check how my auto-scaled momentum measures against the classic stochastic descent, I've tried that one too, and the stochastic version did noticeably worse. I've also tried going to Leaky RELU activation, and that one did a little worse yet. So I think at least the descent part and my Corner activation function are working decently well.

Thursday, January 5, 2023

unmelting a neural network

I've got wondering, could a neural network that experienced a "meltdown" from a too-high training rate be restored? I've got such an example of MNIST and dumped out its contents. What happened inside is that pretty much all the weights got pushed to the values of 1 and -1, the neurons becoming very much the same. So when trying to train it again, this lack of, pardon my French, diversity, wouldn't let the optimization to progress far: it just optimizes this one neuron state per layer in many copies.

For some time I've had an idea of how this problem could be resolved: by a partial randomization of gradients. So I've done this with a simple saw-toothed "randomization" where each next weight gets its gradient reduced slightly more up to a limit, then it drops back down and the next "sawtooth" starts. The starting position of the teeth gets shifted by one on each training pass. 

I've started by "tweaking"  in this way 1% of the gradient size (I've named this option of FloatNeuralNet tweakRate_), and combined it with options momentum_ and autoRate2_. It did work some but barely made a scratch.  OK, so I've bumped it up to 30%. It did work much better but still not quite great. Over about 300K training passes it got to the mean-square error of about 0.15 (by comparison, the normal training from a random point gets to about 0.05 in 10K steps) and would not budge much any more. The verification on the test cases was much closer: mean-square error of about 0.24 instead of 0.19 and the error rate of about 7% instead of 5.5%. So it might be not that bad after all. The combination of the new auto-rate and momentum descent worked great at preventing another meltdown. Interestingly, at the start all the gradient was concentrated in the last layer of 3, and then it gradually shifted towards the first layer.

Then I've tried the same tweak rate of 0.3 for the training from a random initial state, and it didn't have any detrimental effect at all, even did slightly better than without it. So it should be safe to use in general case as a cheap preventative measure.

This also gave me an opportunity to look more into the tuning of the auto-rate algorithm, which I've made a bit better, and also look into what gradients are where. As it turns out, the highest gradient dimensions by far are at the weights that have reached the [-1, +1] boundary, and they skew the norm2 and mean-square of the gradient a lot. When I've changed the code to mark such gradients post-factum as 0, that gave me an opportunity to count them separately and to exclude them from the means. Their number grows early in the draining and then gradually reduces (but not to 0). 

How about if we raise the boundaries, this should reduce the number of the dimensions hitting them and make the training faster, right? And it also would be a good test of the new auto-rate algorithm, since as I've shown before the weights over 1 are much more susceptible to meltdowns. I've tried the boundaries of 10 and 100. The auto-rate worked great but the training got slower. For all I can tell, the higher weights trigger more often the situations where the auto-rate algorithm drops the training rate down, and the rate tends to be 10 to 100 times lower.

But the bad news for the auto-rate logic is that manually picking a just-high-enough training rate still ultimately produces a slightly better result. The auto-rate algorithm starts with a similar rate but then gradually drops it by about 3 orders of magnitude. And as I've been watching the mean-square errors pass by pass, I could see that they showed differently: the fixed-rate algorithm would periodically have the error grow and maybe even stay up a while but then drop lower than where it was before, the auto-rate algorithm tends to just chisel away at the error rate little by little, it still has the error grow a little periodically but squashes it very fast. So perhaps the conservatively low rate gets the function trapped in some local minimum, while the fixed rate breaks out of them (when it doesn't lead to a meltdown). If I let the auto-rate algorithm grow the rate more, and then drop when it gets out of control, it actually does worse. But maybe some better adaptivity could be devised. 

And/or maybe bring the stochastic descent back into the mix. I've been computing the full gradient because this way any kinds of postprocessing represent a relatively low overhead done once per training pass, doing the same after each training case would be very slow. But it's much more resistant to the too-high descent rate, and should be able to shake out of the local minimums better. So maybe they can be combined, doing a few passes stochastic then a few passes deterministic, and so on, with the rate computed at the deterministic passes fed to control the stochastic passes.

Monday, January 2, 2023

optimization 13 - using the gradient sign changes

 When I've previously experimented with FISTA momentum descent, one thing that worked quite well was detecting when some dimension of the gradient changes sign, and then resetting the momentum in this dimension to 0.

One of the typical problems of the momentum methods is that they tend to "circle the drain" around he minimum. Think of one of those coin collection bins for charity where you get a coin rolling down the trough, and then it circles the "gravity well" of the bin for quite a while before losing momentum and descending into the center hole. This happens because the speed (momentum) of the coin is initially directed at an angle from the minimum (the center hole). And the same happens with the momentum descent in optimization, the momentum usually develops at an angle, and just as the current point gets it close to the minimum, the momentum carries it by and away to the other side of the "well", where it will eventually reverse direction and come back, hopefully this time closer to the minimum. But there is a clear indication that we're getting carried past the minimum: the sign of the gradient in the direction where we're getting carried past changes. So if we kill the momentum in this direction at this point, we won't get carried past. It helps a good deal with the quicker convergence.

An overshot is not the only reason why a gradient's dimension might be changing sign. Our "virtual coin" might be rolling down a muti-dimensional trough,  oscillating a little left and right in this trough. But there killing the momentum wouldn't hurt either, it would just dampen these oscillations, which is also a good thing.

Which gave me idea that once we have this, there is no point to the gradual reduction of momentum that is embedded into FISTA through the parameter t. If the momentum reduction on overshot gets taken care off as described above, there is no point in shrinking the momentum otherwise.

And then I've thought of applying the same logic to estimating the training rate. I've been thinking about the Adam method that I've linked to in a recent post, and it doesn't really solve the issue with the training rate. It adjusts the rate between the different dimensions but it still has in it a constant that essentially represents the global training rate. In their example they had just come up with this constant empirically, but it would be different for different problems, and how do we find it? It's the same problem as finding the simple raining rate. The sign change detection to the rescue.

After all, what happens when the too-high rate starts tearing the model apart? The rate that is too high causes an algorithm step to overshoot the minimum. And not just overshoot but overshoot it so much that the gradient grows. So on the next step it overshoots back even more, and the gradient grows again, and so on, and so on, getting farther from the minimum on each step. And the momentum methods tend to exacerbate this problem.

So if we detect the dimensions that change sign, and see if the gradient in them grows, and by how much, and how it compares with the gradient in the dimensions that didn't change sign, we'd be able to detect the starting resonance and dampen it by reducing the training rate. I've tried it, and it works very well, check out https://sourceforge.net/p/triceps/code/1792/tree/, it's the FloatNeuralNet option autoRate2_. So far the tunables for it are hardcoded, and I think that I've set them a bit too conservatively, but all together it works very well, producing a little faster training than I've seen before, and without the meltdowns. 

Another thing I've changed in the current version is the logic that pushes the rarely-seen unusual cases to be boosted for a better recognition. It previously didn't work well with the momentum methods, because it was changing the direction of the gradient drastically between the passes. I've changed it to make the boosting more persistent between the passes, and instead of shrinking the gradients of the correct cases, to gradually grow the representation of the incorrect cases, expanding their gradients. It's still a work in progress but looks promising.

Oh, and BTW, one thing that didn't work out was the attempt to boost similarly the cases that give the correct answer by having the highest output point to the right digit but do this at a very low confidence, so that even the best output is below 0 (and sometimes substantially below 0, something like -0.95). All I could do was shrink the percentage of such cases slightly, from about 17.5% to about 16.5%. And I'm  not sure what can be done about it. I guess it's just another manifestation of a great variability of handwritten characters. Maybe it could be solved by vastly growing the model size and the training set, but even if it could, it would be nice to find some smarter way. Perhaps a better topological representation of the digits instead of a plain bitmap would do the trick but I don't know how to do it. One of the theories I've had was that it's caused by a natural trend towards negative numbers, because in each training case we have one output with 1 and nine outputs with -1. So it we changed the negatives to say -0.1, that would pull the numbers higher. But that's not a solution either, it just moves the average up, diluting strength of the negatives.

a quick test of theory about MNIST

 I've had this theory that the test set in MNIST just contains the digit images that are substantially different from the training set, and this is the reason for why they're not recognized well. I've come up with a quick way to test this theory: just merge the test set into the training set, and see if it make any difference.

It does. When the test set is included into training, it gets recognized very well. There still usually are a couple of images that have issues but that's down from about a hundred. I think it tells us in two ways that the test set contains different images:

(1) If they start getting recognized when trained on them, this means that the training set just doesn't train for the right thing.

(2) When the NN has a hard time even training on some images, this means that the abilities of the NN are getting stretched, that it doesn't have enough brain power to cope with all the different possibilities.

Then I've tried adding some brain-power to the NN. First I went back to the original 16x16 images instead of the shrunk 8x8 but kept the width of the NN layers the same. This grew the cost of the first layer 4-fold but the others stayed the same, so the code didn't slow down so much. This did help some, both with the original and with the mixed training set. But not spectacularly. Then I've also expanded the first layer of neurons 4-fold. This made things slower yet, and provided another improvement, but again nothing spectacular.

I think there are just too many ways to draw the digits - slant one or another stroke a little more or a little less, or use a thicker pen, and the digit suddenly has a stroke where it used to have a hole and the other way around, and the NN gets confused. Also, some digits turn out to be unexpectedly similar, such as 6 and 2. When we draw them by hand, they grow loops around the corners, and more loops where the pen doesn't quite lift between the digits. So both 6 and 2 and up looking as a vertical loop with an opening that's connected to a horizontal semi-loop. The only difference is in the direction: 6 has the vertical loop opening on the right and the horizontal loop opening on the left, while 2 has the vertical loop opening on the left and horizontal loop opening on the right. And the way the strokes shift around, it's easy for the NN to get confused.

I don't know how do they do the handwriting recognition in the reality. My guess is that some preprocessing of the images that extracts the topology of the strokes into a more explicit representation should help a lot. I guess it's also possible to generate more of the sample images for the training set by stretching the existing ones in different ways but that sounds like a dead end. A variation of it might be to use the batching again, but this time include only the images of the same digit into a batch.