Tuesday, July 12, 2022

small servers

 Pretty regularly we (the programmers in general) do make the small servers that sit on some random port and execute commands from the clients. With no security. We rely on the obscurity of the randomly (but not really randomly, since we usually choose it explicitly) chosen port, and usually also on the relative privacy of the development machines for security. And ideally, firewalls that forbid connections to random ports from the outside. But what if the machine is not very private (such as the tool gets run on a production machine), and the firewall doesn't forbid the connection? Someone else might connect. Then of course there is the obscurity of the tool's protocol, but if the tool becomes popular, its protocol becomes widely known. And then if the tool can access files, it easily becomes a target for attacks. So a little obscure tool can outgrow its breeches and become a security hole.

But it seems to be not that hard to fix with a helper library that would check the password on connection. It can automatically generate the password and put it into a file, and a client would read it from the file and supply on connection (like .Xauthority does). Maybe there even is already an existing library that does this?

Saturday, July 9, 2022

constants in Perl

When writing Triceps code, I've been whining that it would be great to have symbolic constants in Perl, this would allow to identify the fields of Triceps records by symbolic names that would translate at compile time into integers. Well, it turns out that there are constants in Perl and have been for more than a decade now. They've been there when I've started writing Triceps, just I wasn't aware of them. They're used like this:

use constant X => 1;
use constant Y => 2;

$a[X] = $a[Y] + 1;
They're referenced by names without any prefix, and can be scoped in packages and exported from them like everything else. Duh.

Thursday, May 12, 2022

Jerry Baulier

 I don't often look at Linkedin, and even less often look at their feed (it's extremely boring), but I've opened it today and right at the top of the feed there is the news that Jerry Baulier who was the CTO of Aleri has died on April 29th. Apparently he had retired from SAS last Fall.

Sunday, March 6, 2022

first attempts at training FloatNeuralNet

 After experimenting with the "intelligent design", I've tried out training the same network from a random state to do the same thing, computing the square function. Yes, it's not a typical usage, but it's a surprising example, I think automatically approximating an analog function  is quite impressive.

And it worked. Not every time, depending on the initial random state, but it when it worked, it worked quite decently. Usually worked better with more neurons. My guess is that with more neurons there is a higher chance that some of them would form a good state. Some of them end up kind of dead but if there were more to start with, it's more likely that some useful number of them will still be good.  The ReLU function I've used has this nasty habit of having a zero derivative in the lower half, and if a neuron gets there on all inputs, it becomes dead.

Which brought the question: how do we do a better initialization? Yes, I know that the bad training from a bad random initialization is normal, but I wonder, could it be done better? I've found the article: https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/.  Initialization-wise it boils down to initializing the weights between the neurons in range +-1/sqrt(num_of_inputs), and initializing the bias to 0.1. There is also an idea to consider the ReLU derivative in the negative side as a small negative value but I don't want to touch that yet. The square root probably comes from the variance of the standard distribution. The bias of 0.1 "pulls" the value up, so that even if the weights are mostly or all negative, there is still a chance of the neuron producing a positive output that would pass through ReLU. This did improve things, but only slightly, and still pretty regularly I would get a bad training.

The next thing I've noticed was that things tend to go better if I do the first few rounds (where presenting the whole set of input data is a round) of training at a higher rate. The recommendation from the article on backpropagation was to use the training rate of 1%.  I've found that if I do the first few rounds with the rate of 10%, I get fewer bad trainings. I've eventually settled down on the first 50 rounds out of 1000 using the high rate.

BTW, I've been worrying that when the weight gets initialized with a certain sign, it would always stay with the same sign, because of the ReLU having a zero derivative in the lower part. The experiment has shown that this is not the case, the weights do change signs, and even not that rarely. I still need to look at the formulas to understand, what did I miss in them.

What next? It would be good to reclaim the dead neurons and put them to a good use. Suppose we do a round of training with the whole training set. If a neuron never produces a positive value on the whole training set, it's effectively dead, and can be reinitialized by randomization. So I've done this. It does add some overhead but most of it happens infrequently, at most once per round. The only overhead to the normal training is increasing the usage counters. Observing the effects, he weird part is that it's not like "once you get all the neurons into a live state, they stay alive". Yes, the dead neurons tend to be found near the start of training but not necessarily on the consecutive rounds, and once in a while they pop up out of nowhere on a round like 300 or 500. I've added the reclamation of dead neurons after each round, and this made a big improvement. I was even able to shrink the network to the same size as I used in the "intelligent design" version and get the good results most of the time, normally better than from my handmade weights (although not as good as handmade + training). But not all the time.

Looking at the weights in the unsuccessful cases, the problem seems to be that it happens when all the neurons in this small model get initialized too similarly to each other. And then they get pushed by the training in the same direction. This seems to be the reason why the larger training rate on the early rounds made an improvement: it helps the neurons diverge. What could I do to make the neurons more different? Well, they got the fixed bias weights of 0.1. How about changing that, since I've got another way of reclaiming the dead neurons now? I've made the bias weights random, and this reduced the frequency of bad cases but didn't eliminate them.

Another thing commonly seen in the bad trainings seems to be that the weights in level 1 tend to be pushed to a low absolute value. Essentially, the NN tries to approximate the result with almost a constant, and this approximation is not a very good one. This drives the weights in the second layer up  to absolute values greater than 1, to compensate for that constness. How about if I won't let the weights get out of [-1, 1] range, will this push the lower layer to compensate there? It did, the weights there became not so small, but overall the problem of the weights driven towards 0 is not solved yet in my example. In an article I've read in IEEE Spectrum, they re-train the networks to new input data by reinitializing the neurons with low absolute weights. So maybe the same thing can be done in the normal training.

P.S. Forgot to mention, I've also changed the initialization of weights to be not too close to 0. With the upper limit on the absolute value limited by the formula with square root, I've also added the lower absolute limit at 0.1 of that.

Tuesday, March 1, 2022

developing FloatNeuralNet in Triceps

For a while I've wanted to experiment with the neural networks but haven't got around to it. I did some playing with them in a training course a few years ago, but there it was a black box into which you plug the training data, not the same thing as making your own.

Doing it in Perl, as I've previously done with Bayesian examples, would probably be slow. Finally I've realized that Triceps is not that much different from TensorFlow: both allow implementing the data flow machines, just with different processing elements. So nothing really stops me from adding a neural network processing element to Triceps. And it would be a good platform for experimentation, and at the same time potentially useful as production code.

My first idea was to implement the neural network as short (16-bit or even 8-bit) integers, since apparently the people say that it's good enough precision.  In reality it's not so simple. It might work for the computations in a pre-trained network but it doesn't work at all for training. In training the values can easily get out of the [-1, 1] range, and also there is a need for a decent precision: if you update the weights by 1% of the gradient, that quickly gets out of the precision that can be represented in 16 bits. On top of this there is the trouble with representation: what value do you take as 1? If INT16_MAX, it's not a power of 2, so after you multiple two numbers, you can't just shift the values to take its top bits as the result (and no, it's not the top 16 bits of an int32, it would be bits 15 to 30 of int32). Before you do a shift, you have to do another multiplication, by (INT16_MAX + 1)/INT16_MAX. If you don't do it, the values will keep shrinking with consecutive multiplications. Of if you take 0x4000 to denote 1, the shifting becomes easier but you lose one bit of precision. All in all, it's lots of pain and no gain, so after a small amount of experimentation I've abandoned this idea at least for now (the vestiges of the code are still there but likely will be removed in the future), and went with the normal floating-point arithmetic. The integer implementation is one of those ideas that make sense until you actually try to do it.

The code sits in cpp/nn. It's not connected to the Triceps row plumbing yet, just a bare neural net code that works with raw numbers.

The first part I wrote was the computation with existing weights, so I've needed some weights for testing. I've remembered a story that I've read a few years ago. It's author, a scientist, was reviewing a paper by his colleagues that contained a neural network trained to recognize the dependency between two variables. He quickly realized that had his colleagues tried the traditional methods of curve fitting, they'd see that they've trained their neural network to produce a logarithmic curve. So how about I intelligently design a curve? Something simpler than a logarithm, a parabola: y = x2. It's pretty easy to do with the ReLU function to produce a segmented line: just pick a few reference points and make a neuron for each point. I've had this idea for a few years, but It's been interesting to put it into practice.

I've picked 4 points in range [0, 1] to produce 3 segments, with x values: 0, 1/3, 2/3, 1. Then computed the angle coefficients of the segments between these points (they came to 1/3, 1, and 5/3), and went to deduce the weights. The network would be 2-level: the first level would have a neuron per segment, and the second level would add them up. The second level's weights are easy: they're all 1, and bias 0. For the first neuron of level 1, the weight of the single input is equal to the angle coefficient 1/3 and the bias weight is 0. For the second neuron, we have to take the difference between the second and first angle coefficients, because they get added up on level 2, so it's 1 - 1/3 = 2/3. And the bias has to be such that this segment "kicks in" only when the value of x grows over the point's coordinate 1/3. Except that we have to work in terms of y that gets computed by weights, not x. So the bias = x * 2/3 = 1/3 * 2/3 = 2/9. Except that we also have to subtract it to make the lower values negative and thus cut off by ReLU, so bias = -2/9. And for the third point we similarly get the weight 2/3  and bias -4/9. The final formula is:

y = ReLU(ReLU(1/3 * x) + ReLU(2/3 * x - 2/9) + ReLU(2/3 * x - 4/9))

This is not the best approximation, since it all sits above the true function, but close enough:

  0.0 ->   0.000000 (true   0.000000)
  0.1 ->   0.033333 (true   0.010000)
  0.2 ->   0.066667 (true   0.040000)
  0.3 ->   0.100000 (true   0.090000)
  0.4 ->   0.177778 (true   0.160000)
  0.5 ->   0.277778 (true   0.250000)
  0.6 ->   0.377778 (true   0.360000)
  0.7 ->   0.500000 (true   0.490000)
  0.8 ->   0.666667 (true   0.640000)
  0.9 ->   0.833333 (true   0.810000)
  1.0 ->   1.000000 (true   1.000000)

I've computed the error on these values from 0 to 1 with step 0.1, and the mean error is 0.0166667, mean squared error 0.019341. Not that bad. Next I implemented the training by backpropagation, as described in https://www.researchgate.net/publication/266396438_A_Gentle_Introduction_to_Backpropagation. So what happens when we run the training over an intelligently designed weights? They improve. After running 100 rounds with the same values 0 to 1 with step 0.1, the results become:

  0.0 ->   0.000000 (true   0.000000)
  0.1 ->   0.013871 (true   0.010000)
  0.2 ->   0.047362 (true   0.040000)
  0.3 ->   0.080853 (true   0.090000)
  0.4 ->   0.156077 (true   0.160000)
  0.5 ->   0.256511 (true   0.250000)
  0.6 ->   0.356945 (true   0.360000)
  0.7 ->   0.484382 (true   0.490000)
  0.8 ->   0.651865 (true   0.640000)
  0.9 ->   0.819347 (true   0.810000)
  1.0 ->   0.986830 (true   1.000000)

The mean error drops to 0.000367414, mean squared error to 0.00770555, a nice improvement. Running the training for 1000 rounds makes the mean error very tiny but the mean squared error doesn't drop much. What happens is that the training had shifted the line segments down where they became mostly symmetrical relative to the true curve, thus the mean positive and negative errors cancelling each other out. But there is only so much precision that you can get with 3 segments, so the mean squared error can't drop much more. In case if you're interested, the formula gets shifted by training to:

ReLU(1.001182 * ReLU(0.334513 * x - 0.015636) + 1.002669 * ReLU(0.667649 * x - 0.225438)  + 1.002676 * ReLU(0.668695 * x - 0.441155) )

The changes in weights are very subtle but they reduce the errors. Which also means that at least in this particular case the 8-bit numbers would not be adequate, and even 16 bits would not be so great, they would work only in cases that have larger errors to start with.

And yes, there are the weights greater than 1! The can moved down to 1 by reducing all the weights in their input neurons but it apparently doesn't happen very well automatically in the backpropagation. It looks like the pressure for values (i.e. gradient) in the backpropagation gets distributed between the layers, and all the layers respond to it by moving in the same direction, some getting out of the [-1, 1] range. Reading on the internets, apparently it's a known problem, and the runaway growth of values can become a real issue in the large networks . The adjustment would be easy enough by rebalancing the weights between the layers but that would probably double the CPU cost of training, if not more. So for the small models, it's probably good enough to just ignore it. Well, except for one consequence: it means that the implementation of ReLU function must be unbounded at the top, passing through any values greater than 0.

Another interesting property of ReLU function is that its derivative being 0 for negative values, means that the values with the wrong sign won't be affected by training. For example, I've also created the negative side of the parabola with the same weights in negative (and for something completely different, connected them to a separate level 2 neuron), and training the positive side doesn't affect the negative side at all.

Another potential issue would apply to any activation function. The computation of the gradient for a particular input includes:

gradient = input[i] * sigma_next[j]

Which means that if the value is 0, the gradient will also be 0, and so the weights won't change. Which in particular means that a neuron that starts with all weights at 0 will produce 0 for any input, and won't ever get out of this state through backpropagation. That's one of the reasons to start with the random weights. I've found the document https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/  that talks about various tricks of initialization, and my guess is that their trick of initializing the bias weights to a small positive value is caused by this issue with zeros. But there might be other workarounds, such as messing with the activation function for gradient computation to make it never produce 0. This is the beauty of having your own code rather than a third-party library: you can experiment with all the details of it. But I haven't messed with this part yet, there are more basic things to implement first.

Tuesday, February 15, 2022

my another book

 I've published my book on car setup for racing:


Not really a computer related subject, but anyway. Unfortunately, it's on the evil Amazon, but unfortunately there isn't a whole lot of other options at  the moment :-(

Monday, December 27, 2021

a book on Perl

 I've been browsing at a used books store, and bought the book "Higher order Perl" by Mark-Jason Dominus.  It's been published in 2003 but is still pretty amazing. It's about doing things that people normally associate with languages like Haskell, but in Perl. But it doesn't stop there. The book goes on to show an implementation of a parser infrastructure somewhat like ANTLR (but as you can imagine, with a lot less code), which is pretty mind-boggling, and then goes on to use it for a declarative drawing system that solves linear equations to determine the positions of the elements as described in a domain-oriented language.

The book can now be downloaded for free: https://hop.perl.plover.com/#free from the web site.

Saturday, December 18, 2021


 It's pretty amazing how the development of AI research is now intertwining with psychology. Not really surprising, since if the current ML models emulate the human thinking, this is what should be happening, but the amazing part is that the development of the AI research has finally reached this point.

The IEEE Computing Edge magazine tends to be rather boring but usually contains one gem per issue. In the November 2021 issue that gem was the history of Ikonas graphics processing (I didn't realize that the raster video started to appear only in mid-1970s, before then the memory costs were prohibitively expensive, and the displays did vector graphics in hardware). In the December 2021 issue that gem is an article about the human emotions "The hourglass model revisited" (https://www.sentic.net/hourglass-model-revisited.pdf, 10.1109/MIS.2020.2992799).

They define the human emotions as a combination of 6 dimensions, 4 of them that can be positive or negative (hence forming a "hourglass", thin in the middle, growing towards extremes): sensitivity, attitude, temper, introspection, and 2 neutral that can be present or not present: expectation and surprise. Additionally they split the attitude into "towards self" and towards others".

The gradations of 4 "main" emotions from positive to negative are defined there as:

Introspection: ecstasy-joy-contentment-melancholy-sadness-grief

Temper: bliss-calmness-serenity-annoyance-anger-rage

Attitude: delight-pleasantness-acceptance-dislike-disgust-loathing

Sensitivity: enthusiasm-eagerness-responsiveess-anxiety-fear-terror

And then they define the compounds, for example "hate =  anger + fear", or "pessimism = expectation + sadness". Though strangely they define despair in the same way as pessimism and not as "expectation + grief", I guess this model has quite a bit of subjectivity.

They also define the absolute strength of an emotion (in range of -1 to +1) as a sum of strengths of its components (or we could say "of its present components") divided by the count of its present (i.e. non-0) components. 

This is an improvement on always dividing by 4, but I wonder if it could be improved by just taking a sum of highest negative and highest positive components. Their formula already shifts the weights in the same direction compared to the previous one, and arguably presence of a little bit of a side emotion would not reduce the effect of the strongest one, and maybe even add to it if both emotions are in the same direction. So maybe it should be for each direction not even




Then I've remembered where I've seen the same problem: in the Bayesian deduction! Just rescale every sentiment from [-1, 1] to [0, 1] and consider it to be a Bayesian probability of an evidence that the sentiment is positive. And then just compose them in the Bayesian deduction, in a classic formula or with formula by odds/chances as described in https://babkin-cep.blogspot.com/2017/02/a-better-explanation-of-bayes-adaboost.html. Then the sentiments that are at 0 would naturally translate to 0.5 and won't affect the computation.

Thursday, December 2, 2021

retraining the neural networks

IEEE Spectrum had printed an issue on AI, and there they've described a way to teach the old neural networks the new tricks without making them forget the old tricks: by marking the more important connections as immutable when training the network on a data set for the new kind of classification.

Which looks like a generalization of the technique of "emergents" (see https://babkin-cep.blogspot.com/2017/06/emergents.html): there everything below the top 2 layers is declared immutable and the top 2 layers are re-trained for the new classification from scratch. The idea being that the first training had taught the network, what details are important, and then re-training assembles the new classifier from these details.

But the generalization goes farther: it can look at the weights of the connections, and if the weight is close to 0 (on the range of [-1, 1]), this can't be an important connection, and is a candidate to re-train, making the model learn the new details too. They say in the article that the old classifier degrades somewhat after this re-training but that's not surprising: the set of the "emergents" has changed during the re-training while the old classifier section is still expecting the old set. It would make sense to fixate the network below the top 2 layers (the "new emergents") and do another training of the old part with that.  The newly learned emergents might even be useful to improve the classification of the previously learned categories. They actually talk in the article about doing a periodic re-training with the previously trained subsets of data but not quite in the same context.

Tuesday, September 7, 2021

"The Practice of Parallel Programming" available as a PDF

It kind of didn't occur to me initially to announce it here, but here we go: I've made the fully edited and typeset version of my book "The Practice of Parallel Programming" available as a PDF. It's exactly the same as the printed version, with a small intro to the Online Edition 1.0 added, and formatted to a page size that is more convenient for printing at home.

Available for download here:


Sunday, August 29, 2021

lexical analysis with regular expressions

Today I've got this simple idea of doing the lexical analysis (i.e. tokenization) in Perl with regular expressions: write all the possible tokes as alternatives in the regular expression, and use Perl's ability to embed code into the expressions to produce the token id. For example:

$s = "++abc"; # string to tokenize
$t = 0; # token ids returned here
while ($s =~ s/^\s*([+][+](?{ $t=1; })|[+](?{ $t=2; })|[a-z]+(?{ $t=3; }))//) {
  print $s, "=", $1, "\n";

This looks for the tokens of "+", "++", and a simple lowercase identifier. One caveat I've found is that if one alternative is a prefix of another (such as "+" and "++"), the longer alternative must go first. But otherwise it looks pretty simple, and should be efficient. I'm probably not the first one to discover it, but I've been looking for the various lexer solutions in Perl and Python and haven't seen this one yet. Of course, it would look better in the extended syntax, and with symbolic token names.

Doing the same in Python is a bit more difficult, since it doesn't allow to execute code as part of the expression. So the parsing would be just to a string, and then matching by a string. Or then looking up the token id by a dictionary (which gets a little tricky if there could be more than one non-constant lexeme, but I guess those could be sorted in the second round if needed). Something like this:

import re
tokens = { "++": 1, "+": 2, } # regexp can be auto-generated from reverse-ordered dict keys
lex = re.compile(r'^\s*(' + r'[+][+]' + r'|[+]' + r'|[a-z]+' + r')') # the string to parse
s = "++abc"
m = lex.match(s) # 3 is the token id for the identifiers that are variable
t = tokens[m.group(1)] if m.group(1) in tokens else 3 # consume the token from the string
s = s[m.span()[1]:]

It doesn't do the loop but goes through all the important steps.

P.S. In Perl, the replacement could be avoiding by using the global scanning instead, scan with option /g, and use \G as the anchor for the last end of string, new start of string:

while ($s =~ /\G\s*([+][+](?{ $t=1; })|[+](?{ $t=2; })|[a-z]+(?{ $t=3; }))/g)

Thursday, June 17, 2021

on Rust

 I've attended a talk on the Rust programming language, and I've had a couple of realizations.

1. The new and interesting thing in Rust is  its enforcement of "borrowing" vs "consumption" in the references, and the separation of the code that strictly follows the straight-jacketed "safe" code from the "unsafe" code.  The rest of it seems to be substandard compared to C++.

I think that most real code will end up having both the "safe" and "unsafe" parts, because the purely "safe" version is conductive only to writing the COBOL-like programs. But the separation forces you to encapsulate the unsafe parts, keep them separate and small, and then the rest of the code would just rely on them instead of spreading unsafety all over the place.

But can the same concept be imported into C++? I think it can. There actually already are the syntactic means to express the borrowing and consumption in C++. Borrowing is "const &", and consumption is "&&".  So all we need is a way to ban the plain "&" and "*" in the parts of the code that we decree "safe".

2. A thing that looks annoying in Rust is its lack of classes, it has traits instead. In this respect it's similar to Go (just as annoying) and the C++ templates (in an implicit way, and with its own even worse challenges). It obviously works but  in an annoying way.

So what is annoying? That the definitions of Rust functions end up containing a list of traits for each argument: Trait1 + Trait2 + Trait3. The same applies to Go. C++ is actually less annoying because it doesn't even require you to define the traits as such, it just uses them in templates. It pays for that by very non-obvious error messages when a trait is missing, but at least you don't have to drag this list around.

But what is a class (or an interface)? An interface is a list of methods. And each method is also a trait. So when we define an interface, what we really do is write

InterfaceA = TraitA1 + TraitA2 + TraitA3

If we could do all this "trait arithmetic", all the annoyance would go away. I wonder if Rust already supports something like that, just it wasn't mentioned in the talk?

Monday, March 22, 2021

on automatic alerting

 A few years ago I've read the book "Thinking fast and slow" by Daniel Kahneman. One of the things that stuck the chord fo rme thre was the story of how the smaller counties have the cases of both higher and lower occurrence of the diseases than the larger counties. This is very much the same as what we see with the automatic alerting: when we set up an alert for, say, request processing latency, and at night there is only one request during a time period that has an anomalously high latency and triggers an alert (and we solve this with the cludges like "don't alert if there are less than 20 points").

It looked like the number of points needs to be taken in the account, with the fewer points to be treated in a more relaxed way. But I couldn't figure out how to put it together.

Recently I've been reading some books on the probablity, and it looks like the solution happens to be well-researched: we have to treat the number of points as a sample from the full larger number of points. So if we get 1000 hits per a unit of time during the day and only 50 hits during the nighttime, we can really see these 50 hits as a sample of the full traffic, with the people sleeping serving as a filter. And then instead of looking at "is the median over N" we should look at a differnt question: given the sample, what is the probability (i.e. confidence range) that the median of the full set of data would have been over N? That should solve the issue with the false alerts pretty nicely in an automatic way. I wonder, why nobody I've heard of had laready done this?

Friday, December 25, 2020

time for more analog computers?

In case if you haven't heard about them, the analog computers were unlike the digital computers. They were the electrical models assmebled to simulate the other real-world events. For example, if you wanted to model a car suspension, you'd have an oscillator circuit to simulate a spring, and a resistor to simulate a shock absorber. They were purpose-assembled for every specific application.

Well, now we have another thing that purports to simulate the real-world systems: the units in the neural networks that are supposed to be like the real neurons. The race to cram more "neurons" into the neural networks has led to the values in the "neurons" to be represented as very short 8-bit fixed-point numbers. And the models still work. Which means that the "neurons" are quite well at tolerating the small imprecisions in the data, which is the typical issue with the analog signals. But in return the analog signals are much more resistant to large disruptions: i.e. if you change one single highest bit in a digital representation of a number, this small disruption changes the value of a number by half its whole range, something that doesn't happen with the analog signals. There already are ideas of improving the reliability of computations by making them more resistant to large errors by accepting the small errors (I think I've read about them in IEEE's Spectrum).

So, the next logical step: why not make each "neuron" in a neural network an analog machine? Instead of 8 bits, have one analog signal, and do all the multiplication and addition on the analog signals. It looks like the perfect match, a combination of these two ideas. Has anyone already done that? If not, why not? Is there a way to do that? 

P.S. It turns out, people are already doing this. I've recently read in Spectrum that people are using photonic computations for the machine learning, and that the photonic computations are limited to this niche because of their limited precision.  I can't really think of a reason for the limited precision other than these computations being analog.

P.P.S. It's 2022 now, slightly more than a year after the original post, and IEEE Spectrum is reporting that IBM is working on a way to use the memory (like Flash, but other similar types too) as an analog summator, doing the "neuron" processing in a memory cell.

Sunday, September 27, 2020

a book on probability

I've stumbled upon the book "Introduction to probability" by Anderson, Seppalainen, and Valko. It's a college textbook. 

Right in the second chapter it does a very good introduction into the Bayesian formula, deriving it from the descriptions of weights of events, in the same way as I struggled to reinvent on my own. And right there it also provides the "General version of Bayes' formula" for the enumerated events, that took me so much time to reinvent on my own. Well, another case of "everything is invented before us" :-) If only we knew, what books to read at the right time :-) On the other hand, I'm not sure if I would have recognized the importance of that formula if I didn't have a use for it in mind first.

In case if you wonder, here is the formula: 

If B1...Bn are events that partition the sample space, then for any event A we have:

P(A) = sum from i=1 to n of (P(A & Bi)) = sum from i=1 to n of (P(A|Bi)*P(Bi))


P(Bk|A) = P(A & Bk) / P(A) = P(A|Bk)*P(Bk) / (sum from i=1 to n of (P(A|Bi)*P(Bi)))

Wednesday, September 23, 2020

cheap malloc-ed stacks

I've accidentally realized that the disjointed stacks (ones composed of the separately allocated chunks) can be pretty inexpensive if they get allocated in fixed pages, and each stack frame is guaranteed to be less than a page (which is fairly common in, say, the OS kernel). And the stack pages can be made larger than the physical pages.

All that the function needs to do at the start of the call is check that there is enough room to the end of the page. Which is easy enough by checking that (SP & PAGE_SIZE) < (PAGE_SIZE-FRAME_SIZE). Well, taking a page from the RISC book, we'd also want to reserve some room at the end of the page for the leaf functions that could then skip the whole check, but it doesn't change the general approach.

If a new page needs to be allocated, it can be taken from a pool of free stack pages (a linked list), and would be linked to the previous page, then the frame created in it (copying the old stack pointer onto the new stack, as in the "classic" Intel x86 sequence of "push BP" that saves the pointer to the previous stack frame). The next page can also be found already linked to the current page from the last time, and then the allocation can be skipped. The function return then doesn't need to do anything special, just "pop SP" and return. And then if the thread blocks, the scheduler can return the unused pages beyond the last active frame back to the pool.

There obviously is some overhead but not a real huge amount of it.

Saturday, May 9, 2020

consistent time and loops

It's obvious that the graph loops cannot be treated like the rest of the links with consistent time, or there would be no pipelining: we'd be forced to wait for an update from the looping link every time we send a record down the loop. Or we might end up sending a whole lot of records down the loop before reading back from the looping link, accumulating a whole lot of records in the queue of the looping link.

So what is the right thing to do? I think it depends on the circumstances, on what we're trying to achieve. I can think of the following uses:

1. Record the processing at real time and be able to reproduce it exactly in an accelerated playback.

2. Process records in accelerated time to start with, but as in real time don't care too much about the exact outcome on the first run as long as it's within the reasonable constraints. (But be able to replay it again in exactly the same way).

3. Process records in accelerated time to start with, and make sure that they get treated in an exactly consistent way, any run from scratch producing the exact same result.

The case(1) can be reasonably resolved by treating the looping links like the inputs form the external systems where we don't particularly care about the exact synchronization: re-timestamping the incoming records, logging them, and then processing with the new timestamp. On replay, read the logged records with their timestamps, and process them in the same way.

A few more words about how the re-timestamping would work: tag the record in the log with both the original timestamp and the new one, and use the new timestamp in the further processing. A simple-minded way to handle the replay would be to just read the log, and ignore the records sent through the looping link on the replay (since they presumably should be the same). A smarter way would be to receive the records form the link and compare them with the records in the log (including the original timestamp). If they match, all is well, the new timestamp from the log can be used. If they diverge then something somewhere has changed and either the replay should be aborted or some mitigation should be done, perhaps in the same way as for the case (2).

So, how to solve the case (2)? The trouble there is that the model's clock gets driven by the incoming records, that get queued, allowing the clock to go far ahead before the records get processed and arrive back through a looping link. The solution seems to be to limit, how far can the records be delayed at the re-timestamping point. Suppose we choose some time interval that is large enough to let the records go through the loop (to allow the queues to do an efficient pipelining) but not outlandishly large, let's call this interval L. Say, 1 second or 0.1 second. Set it as the limit. Then when a record arrives through the looping link (such as, in the figure below, B receiving a record back from D) with a timestamp T, we re-stamp it with a timestamp that is the smaller of: (T + L) and the lowest timestamp of the next record in the incoming queues (going up the links as usual to get the timestamp of the record they would send next, and since the records from the looping links are processed first, this in effect means "go just before that record").

           | A |
           /   \
   +----+ /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+
   |       \    /
   |        \  /
   |         V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |

This would essentially allow the looping links to "run behind the clock" by up to L. So if B receives from A a record with timestamp T2, it can treat the looping link "almost normally", by asking it if it has any earlier records, but using the time limit (T2 - L) instead of T2 (as it would for the normal "downstream" links). This is very much the same as the case (1), only it limits by how much the clock could run ahead. This solution would actually work fine for both the cases (1) and (2). And since it preserves the high priority of the looping links, it would prevent them from buffering too much data.

It can also be stretched to cover (3) by always using (T + L) for the looped records, and not the optimistic minimum. The trouble there is that a large number of records might enter the loop in the interval L, and they will collect on the looping link's queue. But it should be fine for the small test models with a small amount of data.

I've thought about fixing this issue by using a semi-blocking queue on the looping link: essentially compose the looping link from a blocking queue with the usual limit (and special logic) followed by a non-blocking queue. The re-timestamping would happen when moving the records from the blocking queue to the non-blocking one. If the non-blocking queue is empty, treat the records from the blocking queue as having the timestamps behind their original ones by L, and move them to the non-blocking queue when the time (T + L) is reached. Unless the loop is about to go deadlocked. Then instead of deadlocking, move the first record from the blocking queue on the looping link to the non-blocking queue, re-stamping it with the lowest timestamp of the next record in the incoming queues as in the solution for (2). Since the total depth of the queues in a loop is fixed for a given model, that would make the ordering of the records fully predictable, even if they get processed before (T + L).

But this fix has its own trouble: for consistency it requires that the loop must be fully deadlocked before breaking this deadlock. Not just that the queues to and from the top node are full, but that every queue in the loop is full. Which is not easy to track. Triceps already finds the loops between the Trieads, so that part is not a problem, but tracking at runtime how the queues become full, with all the possibilities of intertwined loops, might add a good deal of overhead. So it might not be a practical fix.

This might need more thinking or maybe it's just a problem that doesn't need to be fixed.

In the meantime, there is one more aspect to the loops. A looping link might be used essentially to send a timing event, telling the nodes upstream to re-examine their state. In the example below, the looping link from D to B would be driven by the scheduler on D:

           | A |
           /   \
   +----+ /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+    scheduler
   |       \    /        |
   |        \  / /+------+
   |         V V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |

In this case adding the delay L looks weird, since the time of the scheduler event might already be well past L from the original record that triggered it. It would make sense to make a special scheduler driven by a looping link instead:

           | A |
scheduler  /   \
   ^    | /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+
   |       \    /
   |        \  /
   |         V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |

This scheduler's notion of time would be driven by D but it would generate the events for B. With a special limitation that the events must be scheduled by at least L into the future. Any lower delays would be rounded up to L.

Oh, and one more thing about the models that would run predictably every time from scratch: their infinite-precision clock can't use a single common record counter for the high-precision part, because that would cause race conditions. Instead each node would have to keep its own record counter, and the low bits of the timestamp would consist of (node_id, per_node_counter). Which might actually be to the best, since it would remove the contention point of the common counter. The schedulers can be given their own node ids. And to think of it, the schedulers should probably be given the lower ids than the input nodes, because the timed events should probably be processed before the records coming from outside at the same time.

Tuesday, May 5, 2020

consistent time

I've done some more thinking on the issues of consistent time in Triceps models, and came up with a design. It's not a final design, more of design notes, so that I won't forget them until I get to an implementation. But I think it's quite interesting.

Let's start with the models that are the unidirectional graphs, without any loops, they are much easier to reason about.

The basic premise is that the records need to be timestamped when they enter the model, and then processed in the order of these timestamps. The time issues tend to be highlighted in the diamond-shaped graphs, so let's start with one:

           | A |
           /   \
          /     \
          V      V
        +---+  +---+
        | B |  | C |
        +---+  +---+
           \    /
            \  /
             V V
           |  D  |

An input record arrives, gets stamped, goes to the node (i.e. Triead, a triceps thread, but let's talk in more generic terms for now) A, where it can cause the records to be sent to nodes B and/or C. The records produced by A would have the same timestamp as the incoming record. B and C in turn may produce their own records (which again would get the same timestamp) and send them to D.

When two records with the same timestamp arrive from the same node, they have the natural queuing order, and they can be predictably processed in that order. But what if they arrive from different nodes, say B and C? The result of processing might depend on the order, which basically means that we have to define priorities between all the inputs. Here D has the inputs from B and C: one of them (say, B) would get a higher priority, and its records will be processed before the records with the same timestamp from the other input (C).

Well, that's easy to say, but what should D do when it receives a record with timestamp T from C, and nothing yet from B. For how long should it wait for something from B before it decides that it can process the record from C?

One way is to always send the timestamp metadata along each path even if there are no actual records. But this looks like a lot of unnecessary overhead. And it also has another potential issue: suppose we have two external inputs that get timestamped. The input 1 gets a lot of records for a period of time, the input 2 gets none. How often should the input 2 send its timestamp metadata even if it sends no data records? Well, maybe there is another solution.

Another way to resolve it would be for D to ask B, do you ever plan to send me anything with timestamp T or earlier? If not then D can safely go and process the record from C. If yes then D would have to wait for a notification from B. So the request from D to B can be asynchronous. If B is ready, it will send a timestamp metadata as a callback to D right away, if not then it will send either timestamped records or a bare timestamp later.

But B might not be able to answer this directly. It might have to consult A with the same question first. And A would have to consult the timestamper on its input. If A has two inputs, it would consult both timestampers.

Well, these callbacks would be very similar to just sending the timestamp metadata records all the time, and here would be additional delays involved with the requests upstream. But there are positive things too:

* It tells us, how often the inactive timestamper would get queried and send its metadata: more or less, for every record that comes in through any timestamper.

* In reality, not every node in the model would get input for every incoming record. If none of the nodes at some level produce records, the propagation would stop there and won't go farther downstream. And when the requests are sent upstream, they also would reach only to the timestampers that could send input to this node.

* The requests can be of two types: that the time T has been reached and that the time T has been passed. Because the node D prefers B over C, when gets a record from C with timestamp T, it would have to make sure that B is past T before processing that record. But if it gets a record from B with timestamp T, knowing that C is at T (and not yet past it) would be enough to start processing that record.

* The requests can be amortized. I.e. if the node D has records with timestamps T1 and T2 queued from node C, it can ask B once to send the timestamp updates until it reaches past T2. In the same way, if D asks B whether it's past T2 and B knows that it's already past a later timestamp T3, it can send the notification about T3 right away, and D will be able to use this knowledge later.

* Finally, this communication doesn't have to go through the regular queues. The more efficient data structures can be used to propagate the timestamps between the nodes. The timestamp information is cumulative: once B has notified that it had processed the input up to the timestamp T3, there is no point in keeping the information about it processing the timestamps T1 and T2, they're implied in T3.

* But if the communication goes across the network, the round trip time can become prohibitively expensive for the upstream requests. So the downstream timestamps should be sent on their own, although the cumulative approach would still apply. Of course, if there are multiple streams sent over the network between two systems, all these streams should just be sent across using the same sequencer, so there would be no issue with synchronization between them. But there would still be the synchronization issue between multiple external systems, and between them and the local scheduled events. Yet another possible solution for that would be to just restamp them with the local time if the relative ordering doesn't matter.

Well, let's move on to the next problem. When can a node say that it's past a particular timestamp? It depends on the clock granularity. As long as the clock is at T, there might be more records coming in, so it can't say that it's past T yet. So if the clock granularity is 1 millisecond, we'd be stuck waiting for that millisecond to end. What we need is a clock with infinite granularity. It can be simulated by keeping a single global record counter, and building the timestamp as a pair (time, count). The counter would be like the lower bits of the timestamp. This would also allow to amortize the calls to read the clock: if we have 100 records queued up at the timestamper, we can just increase the global counter by 100, and assign each record the same time and the next count in the allocated sequence.

Why not forget the time at all and move on to just the counter? We'd need time to be able to replay the records later at an accelerated pace, and still handle the time-based events.

Basically, the timed scheduler would run as a timestamper. And yes, it would also have to use the same clock, and do the same (time, count) pairs allocated from that clock. If D has some time-based events, the diagram would become:

        timestamper <----------------------------- clock
             |                                       |
             V                                       |
           +---+                                     |
           | A |                                     |
           +---+                                     |
           /   \                                     |
          /     \                                    |
          V      V                                   |
        +---+  +---+                                 |
        | B |  | C |                                 |
        +---+  +---+    scheduler/timestamper <------+
           \    /        +
            \  //+-------+
             V VV
           |  D  |

This would mean that when D gets a record from B, it would have to check the time not only in C but also in the timestamper. The timestamper can also do the amortization by telling the time of its next scheduled event. But that's a bit tricky: first, there would have to be some count part of the timestamp associated with that future time. There are two ways to go about it: either first tell the time without the count part (essentially, with it at 0), or allocate the count part in advance, when the event gets scheduled, and then just use that count when the timer fires. Second, it might happen that the processing of an incoming record would schedule a new, earlier event. But that's not a problem because the events can't be scheduled in the past. Since the scheduler would run as a part of D and synchronous with it, it could update its estimation and move the next event forward.

It would also have an interesting interaction with the replay: if we want to replay a timestamped log of incoming events and get the model to behave in exactly the same way as before, the scheduler events would have to happen at exactly the same times too. Which means them getting the exact same count part of the timestamp, and that normally won't be stable due to the races for the clock. But it can be resolved by recording the exact time of all the scheduled events into the same log and reusing it during the replay, the schedulers receiving their timestamps not from the clock but from the log, same as the incoming records.

What if we want to change the model between the reruns? Well, in this case we can discard the log for the schedulers, and just get the count values for them form fresh, and write the values into the new log. The running sequence would be slightly different than the first time, but since the model has changed, it wouldn't matter. Or we could even reuse the part of the log that is still applicable, merge it with the events from the new schedulers, and write into the new log. Either way, once the new log gets written, it can be reused again to produce the exact same result on the next replay.

Another interesting thing about the replay, is how do we transition from a replay to the live performance? If we have a recorded log from yesterday, want to replay it and then continue with today's data, what do we do with all the scheduled events that would have fired overnight? This basically suggests that we need to have some kind of "time reset events" that would be used when the time gets moved abruptly. It would allow the application logic to reset properly all the events that have been scheduled for the meantime- either just cancel them or execute them once, depending on the application semantics.

I think this should work quite well for the graphs without loops. The loops add a good deal of trouble. More on them later.

Sunday, May 3, 2020

TLB coherence

The discussion described in the last post got me thinking, why don't we have a hardware consistency for the page address translation cache (TLB), done through the same bus transactions as the usual cache snooping? In a multi-threaded environment, invalidating the TLB across all the threads is a pain that requires the cross-CPU interrupts.

In the simplest case the CPU that drives a TLB invalidation could just do a bus transaction that specifies a virtual address, and every CPU would invalidate its TLB entry that matches this address. If the user processes use the address space randomization, there would be few conflicts between processes, and the kernel address space is common for everyone. In a more complicated way, the virtual address can be accompanied by the process or memory protection id, so that only the pages of the CPUs running that process would be invalidated. And finally, each page translation entry has a unique identifier: the physical memory address where the CPU had read it from. If the TLB keeps this address as a tag (keeping enough bits to identify the page of the page directory is enough, since the rest of bits are determined by the virtual address), it can be used as a tag in the TLB invalidation transaction, and only the exact TLB entrues matching this tag would be invalidated.

Which made me wonder if anyone else had thought of this before. And after a bit of searching, it has turned out that they did: https://arxiv.org/pdf/1701.07517.pdf. These guys have approached the problem from the hypervisor standpoint, but the implementation is exactly the same, using the physical memory address of the entry as a tag. There is no date on the article, but judging by the citation list, it could not have been written earlier than 2016.

Sunday, April 19, 2020

statistical and password-based memory protection

I've recently had an interesting online conversation with Terry Lambert (whom I know from the FreeBSD times) about the memory protection that resulted in I think a new and interesting idea, that I want to write down here.

Terry was talking about the statistical memory protection. I'm not sure if he invented it, or at least the term, since I can't find it anywhere else. Terry gives the related links but none of them are exactly about that:

The statistical memory protection works by having a large common address space where the protection is achieved by allocating memory at random addresses, so the address itself serves as a password. Plus the same physical page can be mapped at different addresses with different permissions, so knowing one address would give the write access while another one the read-only access.

As far as I see it, the major benefit of this system would be in the ability to pass the complex data structures linked by pointers between processes. With the address map being common, the pointers would work in any process. This gives cheap and convenient "mailboxes" for exchanging messages between processes.

This would have multiple potential problems too.

One is the “password leakage” that is also mentioned in one of the papers. Once a password gets out, it can be saved and passed around. With the separate addresses for RO and RW access to the same page this could get pretty bad:  if one of these pointers in an RO page accidentally refers to the RW address, the permission leaks. And actually the whole passing of pointer-connected structures breaks down because the writer and reader would have to use different pointers (with RO and RW addresses). 

Another issue, the memory segments would have to be allocated at the globally unique random addresses, which would have to be a synchronized operation. Considering that the segments have to be spaced far enough form each other, that would consume a sizable number of bits and degrade security of the passwords. 

This would also be susceptible to the brute-force attacks: maybe the attack won’t enumerate the whole address space but it can start with a random point and just see what can be found there. It won’t always be lucky, but sometimes it would. Another way of the attack would be to stick incorrect pointers into the shared pages, hoping that the other process reading the pointer from that page would cause some damage.

The brute-force attacks can be frustrated by keeping a fault counter, and killing the process that has too many faults, but it's harder than it seems at first. The fault counters would have to have a scope greater than a process. Because if you kill a process, nothing stops the user from spawning another one. Also nothing would stop the user from spawning a lot of processes, each doing fewer probes than the fault limit and exiting. But on the other hand, with a greater scope, a malicious application would be able to screw the user completely, so drawing this scope boundary won’t be easy. A possible solution for that problem is to introduce a delay in process's scheduling after a fault, and these delays can even be made to increase if user's global rate is high. This would dramatically slow down the attacks without affecting the correctly running code.

There is one more reason for why the fault counting would have issues: if we allow to transfer the pointers through the global memory, there is no way to know whose fault is the faulty pointer, did the current process get it from someone else?

I've come up with an idea to improve on the statistical memory protection, let's call it password-based memory protection: the same can be achieved in a better and cheaper way by associating a long (say, 128 bit) “password” with a page, effectively making the addresses in the “password-protected” pages 192-bit long (64 bit for the address and 128-bit password). There can be separate passwords for reading-only and reading-writing a page. Possibly execute-only too. But the passwords can be kept separate in some context, so the “short” 64-bit part of the address would be the same for the reader and writer. The context can be inherited from the page: when we load a pointer from a page, we set in the CPU register the password part of it equal to the password of the page. This would also allow to catch the malicious addresses in the shared pages by mismatch of the password. And the whole address space can be split into the private and shared parts, with only the shared part using the passwords. This would allow the private segments to allocate pages without getting a global lock on the address space. And the whole 128-bit space can be used for the password.

It would still be randomly susceptible to the brute-force attacks, as well as to the password leakage.

Some progress can be made on the fault-blaming through. If only the pointers without passwords cold be transferred through the password-protected memory then when we read such a pointer, we can implicitly add the password of the protected page to it and immediately verify that the pointer is valid, that this password also matches the page referred by the pointer. If it doesn't, it means that someone tried to slip in an invalid pointer. If our process has read-only access to the page, we can even say for sure that if was another process's fault and not penalize this one (unless of course it mapped the same page twice, and the other mapping is read-write).

Let's consider in more detail how the pointers would work. For the pointers-with-passwords let's use the "far" keyword from the days of MS-DOS. The common (AKA "near") pointers without password would be just pointers without a special keyword.

int *pn; // a common pointer
int far *pf; // a far pointer that includes the password

They generally can't be cast to each other, a far pointer has to be explicitly converted from a near pointer and a password:

pf = make_far(pn, password);

Typically we would use the common pointers in the structures that get passed through the password-protected memory, to store only the address but not password:

struct data {
  int *p1;
  int *p2;

This would also allow to use the same structure in the normal and password-protected memory.

But when  a common pointer gets accessed through a far pointer, it gets treated in a special way: essentially, it becomes implicitly converted to a far pointer.

data far *dpf;
data *dpn; 

pn = dpf->p1; // incorrect, won't compile
dpn->p1 = dpf->p1; // incorrect, won't compile
pf = dpf->p1; // correct, implicitly copies the password 
  // from dpf to pf and verifies that pf is valid, throws an
  // exception if invalid
dpf->p1 = pn; // incorrect, won't compile
dpf->p1 = dpn->p1; // incorrect, won't compile
dpf->p1 = pf; // correct, verifies that dpf and pf have
  // the same password (throws an exception if the
  // passwords differ), and drops the password when
  // writing into p1
dpf->p2 = dpf->p1; // correct

It would actually be more correct to say that the reference to a pointer accessed through a far pointer turns implicitly into a reference to a far pointer, and handles the conversion between the kinds of pointers.

This of course wouldn't stop someone from copying pointers as data to or from the password-protected memory via memcpy (though it would of course have to be a special version of memcpy that accepts the far pointers as arguments), but that would be their own problem of forfeiting the hardware help in the protection.

On the assembly level, this can be done with the addition of the password registers in the CPU along with the instructions for accessing the password-protected pages. Something like this:

  ; load the address part of the far pointer
  load (r1), r2
  ; load the password part of the far pointer into a password register
  loadpwd (r1+16), pr1
  ; read a value at a password-protected  far pointer
  loadpp r2, pr1, r3
  ; assuming that the read value was a pointer, verify that it was correct,
  ; it must be pointing to a page that uses the same password from pr1
  verifyp r3, pr1, failure_label
  ; or maybe read and verify in one instruction
  loadpv r2, pr1, r3, failure_label
  ; store a far pointer
  store r3, (r5)
  storepwd pr1, (r5+16)

The next development to prevent the brute-force attacks and password leaks is to have the OS to track on the side, access to which segments is granted to the process. 

Then the problem of the leaked passwords can be solved by changing them periodically. Instead of keeping the pair (password, address) in the far pointer, place the all the passwords for all the granted segments into a canonical location in the process. Then represent the far pointer as  the pair (address of the password, address of the data). Whenever a password needs to be changed, change it in the canonical location (of which the OS would keep track). And when an access exception happens due to a stale password in the CPU register, the kernel can check the access rights of the process against the canonical list. If the rights are present, it can update the password in the register and return to retry the access. 

But a consequence of keeping this special section is that the passwords themselves can be stored in a protected section of memory, otherwise inaccessible in the user mode. The password registers in the CPU would have two parts: the address of the password in the special section that can be loaded and stored by the process and the shadow value of the password itself, invisible to the process and used to check the access. Then the users won’t be able to brute-force the passwords at all because they won't be ever able to set the arbitrary passwords, all the password management would happen in the OS. 

Then if the passwords are not user-visible, they don’t have to be long any more, 64 bits, or maybe even 32 bits would suffice as long as they are all unique. Another consequence would be that changing the password would not be needed most of the time, its only use would be to revoke someone’s access.

But wait, there is more. If we don't have the special hardware available, instead of checking on every access we can check when mapping the page into the address space of the process. It’s just that the shared pages would be mapped at the same virtual address in every process. So we’ve made a full circle back to the existing solution here. It feels like not disturbing the page table would be more efficient, but actually not: the segment would have to be allocated first anyway before making it available to any process, and if the page map is shared between all the processes, that means potentially disturbing all the processes, not only the ones that will use that segment. 

The only other part of the implementation would be the verification that the pointers received in the shared pages are safe, that they point to the pages of he same source. This can be done in user space, allocating a “page id map”, where each page address is direct-mapped to the “source id” of this page (say, 4 bytes each). Look up the ids of the page that contained the pointer and of the pointed page, and verify that the ids are the same. The map itself can be sparse, the unused parts mapped to an all-0 page, so it would be cheap. Some kernel help might be useful for handling the sparseness and making sure that the page id map is consistent with the page table. The compiler support with the "far pointers" would also still be useful for the automated verification, even with the user space solution.