Attending the NeurIPS conference last year gave me the idea that I should combine the momentum descent with stochastic descent (people already do that), and then try my changes on that. There is a bit of a problem with how to compute the gradients without incurring twice the overhead, but I've realized that it can be approximated in an easy way, by averaging the gradients from the steps of stochastic descent. Yes, they would be computed at different points in the vicinity, but perhaps close enough. So I've done a little bit of code refactoring to make it a little more structured, and then implemented the momentum on stochastic descent.
The good news is that the momentum descent works pretty well out of the box, producing about half the error in the same number of steps as the plain stochastic descent.
The bad news is that stopping the momentum in the dimensions where the gradient changes sign doesn't work. Well, it still works better than plain stochastic descent but not as good as plain FISTA + stochastic. How about not stopping but reducing the momentum, multiplying it by some coefficient less than 1? The result is about proportional: 0.1 is close to just stopping, 0.9 is close to plain FISTA, and 0.5 is about half-way. Which I think means that the braking definitely messes things up, and it's a wrong-shaped function for braking this way. There is a huge number of dimensions that experience the gradient sign flipping on my examples, like half of them, and evidently it's completely OK.
But maybe it can still be used to adjust the step size. We'll see. I'll need to analyze and understand more what is going on in there.