Sunday, December 29, 2024

realSEUDO and evaluation of activation profiles

 (This is the part 3 of discussion of realSEUDO).

When the activation profiles are generated, there are different ways to scale them that basically depend on how do we normalize the reference images of the cells. 

For example, apparently one traditional way is to normalize them is to make the sum of all the pixel values to be equal 1. A consequence of this is that the larger cells have more bleak reference images, and so their found coefficients for least-squared-error fit to match the same absolute brightnesss in the frame will be higher, since the formula is basically (coefficient = frame_brightness / reference_image_brightness).

 The SEUDO (noise suppression) algorithm does a different normalization: it considers the cell image as a multi-dimensional vector (with each pixel being a dimension) and normalizes it to the euclidean length (i.e. norm2) of 1. This way all the cell images and all the gaussian kernels (that get normalized in the same way) have the same euclidean length and duke it out being equal in this parameter. But it still means that the larger cells will have more bleak images with the same consequences, although less so than in the first kind of normalization.

The differing coefficients mean that the activation profiles of different cells are not really comparable, they will be scaled all differently. It also means that as the cell images are being learned and change frame-to-frame, the coefficients they produce in the recognition part will differ wildly. RealSEUDO solves the last problem by generating the update coefficients for the previous frames as the learning happens. Fortunately, the updates are easy: since the reference images are scaled by a multiplier, they recognition results can be easily adjusted by changing this multiplier (which for them is actually a divisor, since the relation is reciprocal). 

But I think that in general the property of comparability between different cells and between different recordings is very important, and nobody seems to pay much attention to it. From the mathematical standpoint it's "we can always normalize to the mean and variance".  But that normalization works only if we have seen all the cells reach the same maximal absolute brightness. If some cell never reached more than half that, the normalization  will make it twice brighter than it was. But if in some session it never reaches more than half brightness and in another session it lights all the way up, its profiles not only won't be comparable to the other cells but even to the same cell in another session. That's why I think that we should look for some absolutes.

The differing coefficients also create a different way of inequality in recognition: the larger cells with higher coefficients get an advantage in LASSO and experience a weaker push towards 0, and the gaussian kernels, being usually smaller than even the small cells, get a stronger push towards 0. Which might be good or not. I haven't experimented with this, so I don't know which way is better in reality, but it definitely looks like something worth experimenting with.

To make the profiles comparable, I think the right way is to put them into the range between 0 and 1, at least more-or-less. The easy way to look at it is that we take the range of pixel brightness and remap it to [0, 1], then we take the brightest activation of a cell and rescale its brightness so that the brightest saturated pixels also become 1. Then the maximal coefficient also becomes 1. There are a couple of problems left though. 

First, it might affect LASSO, and a different scaling might work better for LASSO. This is easy to solve: since the coefficient changes reciprocally to the scaling of the reference image in normalization, LASSO can be run with one kind of normalization and then a single multiplication operation can convert it to another kind.

Second, there is the problem of noise, at both the low and high ends. At the low end the black isn't black, it's a kind of frothing gray with some level of background lighting. At the high end there are very bright pixels that go above the normal brightness. At the low end realSEUDO makes an assumption that the cell activations aren't very dense, and so the median pixel brightness in the image represents the close-to-average average background lighting, and then the difference between that and the 10-percentile brightness represents the level of background noise (although these levels may need to be adjusted if the assumptions are incorrect, and there is also an adjustment for unevenness of background lighting). So we take this difference and go up by that much from the median to cut the noise off, and that becomes our 0 level. At the high end it collects the brightest levels of every pixel that ever was assigned to the cell, and then takes not the very brightest ones but one standard deviation up from the mean brightness as the reference brightness, to cut off the very bright anomalous pixels.

And here we come to the actual comparison that tries to tell, which results are better and which are worse. The trouble is that in a real recording we don't have the ground truth. Instead the CNMF recognition is used as the assumed ground truth, so the comparison is not really how good the recognition is but how close it is to CNMF. If you make something better than CNMF, you'd get a worse rating than if you match CNMF exactly. This could be resolved by generating the artificial images of known pure signal with noise added on top. And there is a package, NAOMi, that does that. However, unbelievably, NAOMi doesn't save the original profiles it used for generation, producing the CNMF-based recognition as a reference instead, re-creating the same issue.

So a consequence of this is that even though SEUDO is supposed to clean up the noise, in reality it produces the best correlation to CNMF when tuned for a very, very mild noise suppression, much milder than the original SEUDO paper found was the optimal trade-off between suppressing the noise and introducing distortion.

Then the comparison by correlation has a couple of fine points. One is that the selection of zero level matters. Raising the zero level cuts off all the lower fluctuations, and doesn't scale with the correlation's normalization. The best correlation will be with the same selection of zero level.

Another is that the lateral shifts on time axis matter a lot. The very first step of noise suppression in realSEUDO is done by averaging a few frames, and the way it tags the averaged frames is off-center, by the last included frame. So shifting the trace to almost center the averaging makes a big improvement on the correlation. Not quite centering though, for the best result it still has to lag by a frame. Which makes me think that this lag by one frame (or another way to put it, off-by-one difference in frame numbering) is also built into CNMF that is used as the reference.

But comparing the time profiles is insufficient in evaluating the match, the shape of the cells matters too. Figure 4 in the appendix to realSEUDO paper contains a small example of matching profiles together with cell shapes as detected by different algorithms. This is a small subset of the full matching that was auto-generated. It's technically a post-processing, not a part of realSEUDO, but if we want to compare the different algorithms, we need to build some automated ways to find the differences in their outputs. Our version of differencing reused a part of the realSEUDO logic. Unlike correlation that produces a close-or-not score, realSEUDO's evaluation differentiates separate cases, with a score for what it thinks is the best fitting case:

  •  Two shapes are unrelated
  • One image subsumes another closely enough to be two versions of the same image
  • One image subsumes another and is a combination of multiple images
  • Two shapes overlap but are distinct

Just as realSEUDO uses these scores for merge-or-split decisions, the same can be applied to matching, on both space and time. And so, for example, it recognized that CNMF's cell 35 got found in a similar shape by realSEUDO as cell 4, but by OnACID as two separate cells 7 and 22. Of course, without knowing the ground truth we still can't tell, which way is more correct, other than by eyeballing the video at the times of discrepancies.

Sunday, December 22, 2024

LASSO-ISTA-FISTA

This is a follow-up to the previous post, talking about the details of the optimization algorithms (in the math sense, i.e. trying to find a minimum of a function by gradient descent) LASSO-ISTA-FISTA that we've modified for realSEUDO. To remind, FISTA is a momentum gradient descent on top of ISTA which is the logic for selecting the step size for the simple gradient descent on top of LASSO which is the idea of adding a bias to the function being optimized to drive the dimensions with small values towards 0. The amount of this bias is adjusted by the parameter lambda.

LASSO minimizes a function like 

||f(x) -y||^2 + lambda*|x|

Here x is a vector (you can think of it as coordinates in an n-dimensional space), in our case of analyzing the brain cell videos it's the vector of activation levels for all the cells and and all the SEUDO's gaussian kernels. y is also a vector (of a different size) that represents the brightness of all the pixels in an image (an image can be also thought of as a matrix but here it doesn't matter, we just string out all the rows into a vector). f(x) is a vector-to-vector function that maps the activation levels to the image in pixels, in our case multiplying the pixel values in the dictionary image of each cells by its activation level and then adding up the brightness of each pixel from all the sources. So f(x)-y is the error between the actual image and our match, and also a vector. Then the operation ||_|| converts the error vector to a single number by taking its euclidean length (AKA norm2) in n-dimensional space (with the n here being the size of vector y). Taking its square lets us skip the square root of the euclidean computation, since both with and without the square root the minimum is at the same value of x. The bias is the lambda*|x| part. |x| is the norm1 of the vector, computed as the sum of absolute values of all the elements, and lambda adjusts the strength of its effects.

In case if you wonder, what these functions look like, here is a simple example. It's a smooth function with creases added along the axes by the bias function. The effect of the bias is that the gradient in every dimension gets lambda added to it on the positive side and subtracted on the negative side. And since the minimum is at zero gradient, this makes the dimensions with shallow gradients around 0 (which means that they don't matter much) to be pushed to 0.

The idea of LASSO is to interleave the descent steps: one according to the gradient of main function, another one according to the gradient of the bias function, which still requires computing the gradient of the main function at the new point. Which not only doubles the number of heaviest steps (gradient computation) but also tends to dither around the minimum, taking steps in different directions according to the tug-of-war of two gradients. Well, this is stupid, so why not just add two gradients together to start with? This worked very well, but turned out to be not new: someone beat us to it not that many years ago. 

By the way, in our case we're only looking for non-negative values, so we just cut off the negative side, looking at only one n-dimensional quadrant, and then the gradient bias will be simply lambda in each dimension, makes the computation a tiny bit cheaper, and also a bit easier to reason about, since the creases get cut off and the function becomes smooth in all the acceptable range.

The goal of ISTA is to select the gradient step that is as big as possible but doesn't overshoot the minimum. The safe step size depends on the maximum steepness, i.e. maximum second derivative of the function that can be encountered in stepping towards minimum (which basically means between the starting point and minimum, unless we overshoot), known as Lipschitz constant L, the step size is its reciprocal 1/L. This limit comes basically from Newton's method: we want to get to the point where the first derivative is zero, so how far can we step to not overstep that point? 1/L. In fact, if the second derivative is constant, this step would bring us exactly to the minimum.

The momentum descent in FISTA makes the good estimation of basic step less important but still plenty important, as the big steps grow the momentum much faster. TFOCS (the Matlab library for gradient descent) does a periodic sampling of the current steepness by looking at the change of gradient during the step. But TFOCS is a general library, could we do better for our specific kind of function? We've looked into the static estimation of the magnitude of the second derivative, and it worked worse then TFOCS. But out of the ways of estimation, one that worked better, was to take a separate derivative in each dimension (i.e. for dimension m the sum of dx_i/dx_m for all i) and use it separately for that dimension. This is something that only an engineer would do and a mathematician won't: try something and see if it works. And this something is "compute and use a separate L in every dimension", so that in stepping the gradient gets distorted according to these partial values of L, slowing down the stepping in the dimensions with steep second derivative and speeding up in the dimensions that are nearly flat. Now, as I've come back to it, and looked at it again analytically, my first reaction was "WTF, this can't make things better!" but it did. I even went back to the code and tried again, and it still did.

The thing is, as you get more than two dimensions, the negative terms with x_i*x_j kick in and the function stops being consistently convex in all dimensions, the gradient doesn't point exactly towards the minimum any more, and the second derivative along the direction of gradient stops being constant. So the static estimation worked worse than the sampling done by TFOCS, and we  switched to the sampling too. But then how about applying the idea of separate L by dimension to the sampling? As it turns out, it still worked well, although not always. In all the early examples we tried, it consistently worked better than TFOCS but then we've encountered more examples where it worked a little worse too. It all depends, and maybe there is a way to pick and choose, which way to use for a specific case.

And then the idea that we can treat the dimensions separately came handy for solving the overshooting problem with the momentum descent, by stopping the momentum in the dimensions that either have the gradient change sign or go out of acceptable range. This worked very, very well. And didn't leave much of FISTA as it is: momentum descent with brakes. The gradually growing brakes got replaced with the sudden stopping by separate dimensions.

All these ideas can be applied to any momentum descent, including TFOCS and other libraries. Whether they would work well, depends on the particular functions being minimized but that's OK, that's why TFOCS already has multiple options to be used in different situations. Although the fact that sudden stopping seems to work decently well even with such jagged functions as in the NNs suggests that it's likely to work well in many if not all cases.

Triceps and realSEUDO at NeurIPS

I've just come back from the NeurIPS 2024 conference. With Triceps listed as my organization. I've done a project with mentoring my family member at JHU, and well, since it's not work-related, using my day job would be wrong, and I'm not at JHU either (and I've paid for attending the conference myself, JHU turned out to be too cheap for that). So Triceps it is, especially that Triceps is used in the project. And that's, by the way, exactly the project I've mentioned before as an application of Triceps, just I couldn't tell the details before it got published. Well, now we've had a poster session at NeurIPS (which people tell me is kind of a big deal, I've just had fun), and the paper is out: "realSEUDO for real-time calcium imaging analysis" at https://neurips.cc/virtual/2024/poster/94683 or https://www.researchgate.net/publication/380895097_realSEUDO_for_real-time_calcium_imaging_analysis or https://arxiv.org/abs/2405.15701. Here is what it does, in simple words.

The calcium imaging is a brain imaging technique, in live (although not exactly unharmed) brain. People have bred a strain of mice whose neurons light up when they activate. So they take a mouse, open up its skull, and stick a camera with a microscope on it. This allows to record a movie of a living brain where you can see what is going on in it literally cell by cell, the brightness of the cells corresponding to their activation level. 

Once you get the movie, you need to analyze it: figure out where the cells are, and their activation levels. The "golden standard" of the detection algorithm is CNMF. It solves an optimization problem (in the mathematical sense, i.e. finding the minimum of a function that represents the error), for two mappings at once: what pixels belong to what cell, and what cells activate at what level in each frame (activation profile). More exactly, it deals not with separate pixels but with gaussian kernels. "Kernel" in the math speak is any repeatable element (so for another example, "CUDA kernels" are the CUDA functions that are repeated across the GPU), but in this case it's a sprite that gets stenciled repeatedly, centered on each pixel. "Gaussian" is the shape of the sprite, a 2-dimensional representation of the normal distribution where the brightness corresponds to the weight, so it's an approximation of a circle that is brightest in the center and then smoothly goes to nothing at the edges.

The trouble is that CNMF works only on the whole movie after it has been collected. You can't use it to decode frame-by-frame in real time because it needs to see all the frames at once. So you can't identify cells and then immediately use this knowledge to alter the stimulus and see what effect this has on the brain. There is another algorithm, OnACID that does frame-by-frame decoding using a variation of CNMF logic, except that it requires a starting period of 100 frames or so to get the initial cells, and then the quality of identifying the cells is much worse than with CNMF, and it's still not very fast, substantially slower than the 30 frames per second even on an 80-CPU machine we used.

In the same area but in a little differentt niche, SEUDO is the JHU professor's previous project, a technique to reduce noise in the activation profiles. It can be used together with the other algorithms. Aside from the random noise, there is an inherent cross-talk: the neurons are stacked in multiple layers and can have very long dendrites that can overlap with many other neurons, adding little weak light spots that get registered as weak activations on these other neurons. So the idea there is that to recognize the activations we both use the known cell shapes and also fill the image with gaussian kernels centered on every pixel, and solve the minimization problem on all of them, then throw away the values for gaussian kernels, thus discarding the cross-talk from unknown sources. The little light spots that don't fit well with the known cells get attributed to these gaussian kernels, and so the cell activation profiles become smoother. The problem is that it's slow, and quickly becomes slower as you increase the size of the kernels. If you take a kernel of 30x30 pixels, you get extra 900 coefficients in the quadratic equation you're trying to minimize (and that's for every kernel, which gets centered on every pixel).

The goal was to get this whole thing to work in real time (i.e. at least 30 fps), and we've achieved it. It works in real time (even including SEUDO, which is really an optional extra processing on top), and produces the quality that is generally way better than OnACID, and fairly close to CNMF. Well, arguably, at least sometimes better than CNMF but the problem is that there is no existing accepted way to rate "better or worse" (the unscientific method is by eyeballing the small fragments of the movies), the accepted way of rating is "same or different from CNMF". I'll talk a bit more about it later, in a separate post.

So the work really has 5 parts:

  1. The optimization (in programming sense) and parallelization in C++ of the optimization (in math sense) algorithm
  2. Improvements to the optimization (in the math sense) algorithms
  3. An off-shoot of trying to apply some of these improvements to optimization algorithms, and more, to the neural network training problem (the things I did in Triceps)
  4. Improvements to the SEUDO algorithm
  5. The completely new logic for automatically finding the cells in the movies (that got the name of realSEUDO)

The first part is kind of straightforward (read my book :-) with only few things worth mentioning. First, outdoing Matlab and the TFOCS package in it is not easy, it's already well optimized and parallelized, but doable. There are two keys to it.  One is to match what TFOCS does: represent the highly sparse matrix of coefficients by "drawing" the sprites and thus regenerating the matrix on the fly whenever it gets used, otherwise the matrix just won't fit into the memory. The computation of a gradient requires two passes of drawing. TFOCS does it by generating two complementary drawing functions. We use two passes with one drawing function but computing different sums: the first pass goes by columns and computes the intermediate sums (common subexpressions) that get stored, then the second pass goes by rows and computes the final sums from the intermediate ones. The two passes with stored intermediate values reduce the complexity from O(n^3) to O(n^2) compared to doing everything in one pass. The second key was to make the pixel drawing function into a template rather than a function call, because the overhead of a function call in the inner loop has a higher cost than the rest of computation. And another thing that needs mentioning is that to pass data between Matlab and C++ code you have to use Matlab's C API. Their C++ API is extremely slow, adding much more overhead than the computation itself.

But that's still not fast enough. Then it was the turn of improving the optimization (in math sense) algorithms. There are three: FISTA which is a momentum gradient descent on top of ISTA which is the logic for selecting the step size for the simple gradient descent on top of LASSO which is the idea of adding a bias to the function being optimized to drive the dimensions with small values towards 0. The amount of this bias is adjusted by the parameter lambda (wait, we'll come to its effect yet). 

I've started writing up the LASSO-ISTA-FISTA summary but even in a summary form it got quite long. So I'll put it into a separate follow-up post. The summary-of-a-summary is that a weird but serendipitous experiment gave me an idea of treating the gradient dimensions separately, and that led to the idea of the momentum stopping on overshoot by dimension. The momentum gradient descent is kind of like a ball rolling down a well, observed with a stroboscope: each step of a descent is like the passage of time to the next strobe flash. So the next time we see the ball, it might already be past the minimum and the momentum carrying it up the opposite wall. Eventually it will stop, go back, and likely overshoot the minimum again, in the other direction. In the real world the ball eventually stops because of friction that drains its energy. In the momentum descent the friction is simulated by gradually reducing the momentum coefficient (which means growing friction). The trouble is, you don't know in advance, how much friction to put in. With too little friction, the ball will bounce around a lot, with too much friction it will be rolling in molasses. So people do things like resetting the momentum coefficient after some large number of steps. But if we look at each dimension, there is an easy indication of an overshoot: the gradient in that dimension changes sign. So at that time we can just immediately kill the momentum, and we don't even have to kill the whole momentum, we can kill it in just that one dimension. And this worked really well, to the point that the friction became completely unnecessary. The other place where the sudden stopping helps is when the variables try to go out out of range (for example, in our case we have the boundary condition that each dimension is non-negative).

This worked very well for a smooth (well, mostly, except for the creases added by LASSO bias, but then again the creases are at the sign change boundaries, and we always stay on the positive side) quadratic function. Would it work on a much more jagged functions used in the training of neural networks? This is what I've tried in the NN code in Triceps, and as it turns out, the answer is sort of yes, at least in certain limited ways, as I wrote at length before in the other posts, and the fraction of dimensions experiencing the stopping can also be used to auto-adjust the training step size (i.e. the training schedule - the existing algorithms like Adam rely on preset constants, here it adjusts automatically). It's currently been applied to plain gradient descent only, and not beating the stochastic gradient descent in the current form, but on the other hand, it does make the plain gradient descent with momentum not only work at all but work not that much worse than the stochastic descent. Check out the figure with NN training speed comparison in the appendix. Now at NeurIPS I've learned things that gave me more ideas for marrying this logic with the stochastic descent, so there are more things to try in the future.

For SEUDO, the trick has turned out to be the sparsity. It still works decently well when the kernels are placed less frequently, up to about 1/3 of the diameter. So with the sprites 30x30 you get a slow-down by a factor of 900, but place these sprites in a grid with the step of 10 pixels, and you get the factor of 100 back. It would probably be even better to place them on a triangular grid (basically, the same grid but with offset on alternating rows, and different horizontal and vertical step sizes to form the equilateral triangles of neighboring centers), to keep the distances between the centers even in every direction, but we haven't got around to do that (yet?).

And the way to make the detection of the cell images in the video (that's what realSEUDO is, for which incidentally the plain SEUDO is optional) fast was to do it with the old-fashioned image processing. It goes in two stages (well, and a noise reduction pre-stage, and it also expects that the image is already stabilized): the first stage finds the contiguous spots above the background brightness, then adds time as the third dimension and builds the sausage-like shapes of the cell activations. They are pretty jagged at the ends, as the cells start to light up in little spots, then these spots gradually expand and merge, and going dark again looks similar but in the opposite direction. The sausage gets eventually squashed along the time axis to build the reference cell image. The trouble is, there are multiple layers of cells in the brain, with overlapping cells sometimes activating at the same time. So the second stage is used to figure out these overlaps between the cells: when do we see the same cell but at a different brightness level, when do we see a completely different overlapping cell, and when both (or more) overlapping cells are lighting up at the same time. So it merges and splits the cell images accordingly, and generates the events telling what it's doing. At the same time it tries to fit the known cells into the current frame and generates the events for the cell activations (Triceps is used for reporting the events as a more efficient way to collect or forward them than Matlab). The splitting/merging code is very important but purely empirical, with a few important tunable coefficients chosen by trial and error at the moment. It's a tour-de-force of empiric engineering beating science. But perhaps some merging of both approaches is possible, maybe using the detected shapes as a starting point for CNMF-like further optimization. If the starting point is already very close to the optimum, the short path to the optimum would be fast (as we see in reusing the detected activation coefficients from the previous frame as the starting point for the current frame).

And there are other little things, like partitioning a large image into segments and processing them in parallel, and then splicing the cell shapes and activation profiles back together (this is not a new idea, CNMF does it too). Since all the high-level logic has been done in Matlab, the parallelism is a bit weirdly shaped by Matlab's abilities: there is parallelism on top with the partitioning by segments and on the bottom with the C++ code running FISTA but there is a good deal of parallelism opportunities in the middle of the splitting/merging logic that aren't used. The trouble is basically  that Matlab doesn't have shared read-only data, it starts each parallelization by an expensive copying of the whole relevant state, and then ends by copying back the result. The other reason to use the segments comes from the overhead on the cell shapes: the way Matlab works, the shapes have to be represented as image-sized matrices, quadratically increasing the overhead of all the operations as the image grows. The sparse matrices (as the C++ part of the code already does) would fix that, there is such a recent new feature in Matlab, or obviously if moving the code from Matlab to another language. BTW, these are much simpler sparse matrices than ones produced in FISTA equations, here we know that all the non-0 values are located in a small bounding box, so they're trivial to implement.

There are also things to be said about the evaluation of detected profiles (which also comes up in splicing the segments), but it's a largish subject for another post.