Wednesday, October 7, 2015

Bayes 4: the fuzzy logic

Before moving on to the other aspects, I want to wrap up with the basic formula, this time in its fuzzy-logic version. As I've already mentioned, sometimes it's hard to tell if the event is true or not. It would be nice to have some way to express the idea of the event being very likely or slighlty likely or slightly or very unlikely. It would be even nicer to represent our confidence in the event happening as a continuous range, with 1 meaning "the event is definitely true", 0 meaning "the event is definitely false", 0.5 meaning "we can't tell at all if the event is true or false", and the points in between representing the values in between. Let's use the letter C to denote this confidence, or maybe C(E) to show the confidence of which event we're talking about.

There is a way to generalize the Bayes formula for this confidence range. Way back when I've read it in that British book without explantions and marveled at how people come up with these formulas but in the more recent times I was able to re-create it from scratch from the logical consideration, so I'll show you, where does this marvelous formula come from (and a little later I'll show you where the bayes formula comes from too).

First, let's look at the boundary conditions. Kind of obviously, with C(E)=1 the formula must give the same result as the basic Bayes formula with E being true, with C(E)=0 it must give the same result as the basic formula for E being false, and for C(E)=0.5 it must leave the P(H) unchanged.

Let's look at the basic formulas again:

P(H|E) = P(H) * ( P(E|H) / P(E) )
P(H|~E) = P(H) * ( P(~E|H) / P(~E) )

If we substitute

P(~E|H) = 1 - P(E|H)
P(~E) = 1 - P(E)

then the formulas become:

P(H|E) = P(H) * ( P(E|H) / P(E) )
P(H|~E) = P(H) * ( (1-P(E|H)) / (1-P(E)) )

In both cases we multiply the same P(H) by some coefficient. The coefficient in both cases is computed by division of something done with P(E|H) by something done with P(E). For symmetry we can write in the first formula P(E|H) as 0+P(E|H) and P(E) as 0+P(E), then the formulas become even more like each other:

P(H|E)  = P(H) * ( (0+P(E|H)) / (0+P(E)) )
P(H|~E) = P(H) * ( (1-P(E|H)) / (1-P(E)) )

If we represent the points 0+P(E|H) and 1-P(E|H) graphically as a line section, we can see that in both cases they are located at the distance P(E|H), only one is to the right from 0, and another one is to the left of 1:

     0   P(E|H)         1-P(E|H)  1

And the same works for P(E). This way it looks even more symmetrical.

The natural follow-up from here is that we can use the confidence value C(E) to split the sections [ 0+P(E|H), 1-P(E|H) ] and [ 0+P(E), 1-P(E) ] proportionally. For example, if C(E) = 0.75, it will look like this:

           P(0.75 E|H)
     0   P(E|H) |       1-P(E|H)  1
             0.25   0.75

The (0+P(E|H)) gets taken with the weight C(E) and (1-P(E|H)) gets taken with the weight 1-C(E) and they get added together. If C(E) is 0.75, it means that the split point will be located closer to (0+P(E|H)), dividing the original range at its quarter-point.

In the formula form this can be expressed as:

P(H|C(E)) = P(H) * ( C(E)*(0+P(E|H)) + (1-C(E))*(1-P(E|H)) )
 / ( C(E)*(0+P(E)) + (1-C(E))*(1-P(E)) )

Let's check that this formula conforms to the boundary conditions formulated above:

P(H|C(E)=1) = P(H) * ( 1*(0+P(E|H)) + (1-1)*(1-P(E|H)) )
 / ( 1*(0+P(E)) + (1-1)*(1-P(E)) )
 = P(H) * P(E|H) / P(E)

P(H|C(E)=0) = P(H) * ( 0*(0+P(E|H)) + (1-0)*(1-P(E|H)) )
 / ( 0*(0+P(E)) + (1-0)*(1-P(E)) )
 = P(H) * (1-P(E|H)) / (1-P(E))

P(H|C(E)=0.5) = P(H) * ( 0.5*(0+P(E|H)) + (1-0.5)*(1-P(E|H)) )
 / ( 0.5*(0+P(E)) + (1-0.5)*(1-P(E)) )
 = P(H) * ( 0.5*P(E|H)) + 0.5*(1-P(E|H)) )
 / ( 0.5*P(E) + 0.5*(1-P(E)) )
 = P(H) * ( 0.5*P(E|H)) + 0.5 - 0.5*P(E|H)) )
 / ( 0.5*P(E) + 0.5 - 0.5*P(E)) )
 = P(H) * 0.5 / 0.5
 = P(H)

They all match. So this is the right formula even though a bit complicated. We can make it a bit simpler by splitting it into two formulas, one representing the function for the proportional splitting, and another one computing the probability with the use of the first function:

Pc(P, C) = P*C + (1-P)*(1-C)
P(H|C(E)) = P(H) * Pc(P(E|H), C(E))/Pc(P(E), C(E))

Now you might complain "but I don't like the shape of this dependency I feel that the confidence should be represented in a more non-linear way in one way or another". Well, if it's too linear for you, you don't have to use it directly. You can define your own confidence function Cmy(E) and map its values into C(E) in any kind of non-linear form, and then use the resulting C(E) values for computation.

In fact, I like a variation of that trick myself. I like limiting the values of C(E) to something like 0.00001 and 0.99999 instead of 0 and 1. It helps the model recover from the situations that it would otherwise think impossible and provide a way to handle the problem of the overfitting. But more on that later.

Tuesday, October 6, 2015

Bayes 3: the magic formula

The next thing is the conditional probability. It's written as P(H|E) or P(E|H).

P(H|E) means the probability of the hypothesis H being true in case if we know that the event E is true.

P(E|H) means the probability of the event E being true in case if the hypothesis H turns out to be true.

And now finally the magic formula discovered by Bayes:

P(H|E) = P(H) * P(E|H) / P(E)

It tells us how the probability of a hypothesis gets adjusted when we learn some related data. I haven't watched "House, MD" closely enough to give an example from the medical area, so I'll give one about the car repair.

Suppose our hypothesis is "the battery is discharged". One way to test it is to turn on the headlights (with the engine not running) and see that they come on at full brightness, that would be our event.

Suppose our original ("prior") probability P(H) was 0.1. And suppose the probability of the headlights being able to come up was 0.7. You might think that no way would the headlights shine brightly with the discharged battery, but the headlights consume a lot less current than the starter, and the battery might be capable of providing the small amount of current without too much of a voltage drop, so the headlights might still look decently bright to the eye. So let's say that P(E|H), the probability of the headlights shining brightly with a discharged battery, is 0.05.

Where do all these numbers come from? Rights now I've pulled them out of my imagination. In a real expert system they would come from two sources: the training data and the computation of the previous events. Some people might say that the numbers might come from the expert opinions but spit them in the eye, getting the good probabilities without a good set of training data is pretty much impossible. We'll get to the details of the training data a little later, now let's continue with the example.

So we've tested the headlights, found them shining bright, and now can plug the numbers into the formula:

P(H|E) = 0.1 * 0.05 / 0.7 = 0.00715

As you can see, the probability of this hypothesis took a steep plunge. Why did it happen? Basically, the effect depends on the difference between how probable this effect is in general (P(E)) and how probable it is if this hypothesis is true (P(E|H)). The formula can be reformatted to emphasise this:

P(H|E) = P(H) * ( P(E|H) / P(E) )

If P(E|H) is greater than P(E) then the probability of the hypothesis gets pulled up. If it's less then the probability gets pulled down.

What if we try the experiment and the result is false? Obviously, we can't just use this formula, since E is not true. But then the event ~E will be true, and we should instead use the similar formula for ~E and also adjust the probability of the hypothesis:

P(H|~E) = P(H) * P(~E|H) / P(~E)

Where do P(~E|H) and P(~E) come from? Again, from the training data, and they are connected with the direct values in the obvious way:

P(~E|H) = 1 - P(E|H)
P(~E) = 1 - P(E)

A lot of examples of Bayesian logic I've found on the Internet don't adjust the probabilities if an event is found to be false, and this is very, very wrong. Basically, they treat the situation "the event is known to be false" as "the event is unknown". If you have a spam filter and know that a message doesn't contain the word "viagra", you shouldn't treat it the same way as if you don't know if this word is present. The result will be that you'll be driving the probabilities of the message being spam way high on seeing any of the suspicious words. In case if you wonder if ignoring the negative event is sheer stupidity, the answer is no, it isn't, there are reasons to why people do this, they are trying to compensate for the other deficiencies of the bayesian method. Only a rare spam message indeed would contain the whole set of suspicious words, so if you properly test each word and apply it as positive or negative, you end up with pretty low probability values. I'll talk more about it later, although I don't have the perfect solutions either.

Continuing the example, suppose the headlights didn't come on brightly, and we have the probabilities

P(~E|H) = 1 - 0.05 = 0.95
P(~E) = 1 - 0.7 = 0.3

Then we can substitute them into the formula:

P(H|~E) = 0.1 * 0.95 / 0.3 = 0.31667

The hypothesis of a dead battery took quite a boost!

Now let me show you something else. Suppose the prior probability P(H) was not 0.1 but 0.5. We put the data with this single adjustment into the same formula and get:

P(H|~E) = 0.5 * 0.95 / 0.3 = 1.58333

Shouldn't the probability be in the range between 0 and 1? It sure should. If you get a probability value outside of this range, it means that something is wrong with your input data. Either you've used some expert estimations of probabilities and these experts turned out to be not any good (that's why using a proper set of training data is important), or you forgot to adjust the probabilities for the previous computations you've done. Or maybe the rounding errors from the previous computations have accumulated to become greater than the actual values.

Sunday, October 4, 2015

Bayes 2: hypotheses and events

A bayesian expert system deals with the probabilities of hypotheses. It tries to compute these probabilities based on the experimentation with the black box, and choose the most probable one (or maybe more than one).

The probability of a hypothesis H is commonly written as P(H). Naturally, it's a real number in the range between 0 ("definitely false") and 1 ("definitely true"). If you haven't dealt with the probabilities before, now you know this.

Each hypothesis H has an opposite hypothesis that I'll write as ~H (in the mathematical texts it's written as H with a line above it but it's inconvenient to type, so I chose a notation like the C language). If the hypothesis Hlupus is "the patient has lupus" then the opposite ~Hlupus will be "the patient doesn't have lupus". I'll also call ~H the "negative hypothesis", as opposed to the "positive" H.

The probabilities of positive and negative hypotheses are connected:

P(H) + P(~H) = 1

Which means that whatever happens, exactly one of H and ~H is true. We might not know for sure which one is true but if we think that H has the probability 0.7 (or in other words 70%) of being true, this means that it has the probability 0.3 of being false, while the opposite ~H must have the probability 0.3 of being true and 0.7 of being false. This is a very rigid connection, and we can use it for various deductions.

An event is not just an experiment but an experiment with a particular outcome. Not just "look at the patient's throat" but "I've looked at the patient's throat and found that it's sore". Just as the hypotheses, the events have their opposites too. If the event E is "I've looked at the patient's throat and found that it's sore", its opposite ~E will be "I've looked at the patient's throat and found that it's NOT sore". The experiment is the same but its outcome is opposite for the negative event. It's real important to note that the case "I haven't looked at the patient's throat yet" is different: until we look at the throat, neither E or ~E have happened. But once we look at the throat, we can tell that either E is true and ~E is false for this particular patient or the other way around, E is false and ~E is true. A typical mistake (one found in Wikipedia for example) is to mix up the cases when we haven't looked at the throat and when we looked at the throat and found that it's not sore. If you mix them up, your calculations will all be way wrong. Beware of it.

A clever reader might say "what if I looked at the patient's throat and can't say for sure whether it's sore or not, it's slightly reddish but not totally so"? Glad you asked, that's the department of the fuzzy logic, and I'll get to it later. But for now let's assume that we can tell whether it's one or the other. If we can't tell then we'll just treat it the same way as we haven't done the experiment at all (which is by the way consistent with how it's treated by the fuzzy logic).

We can't know the outcome of an experiment before we do the experiment, whether E or ~E will be true. But we can estimate the probability of E in advance, based on the other information we have collected so far. This advance estimation is called the probability of an event, P(E). For example, if we know that 30% of patients visiting a doctor have a sore throat, we can estimate P(Esorethroat)=0.3 based on our knowledge that this person came to see a doctor.

Just as with hypotheses, the probabilities of the complementary events are connected:

P(E) + P(~E) = 1

And it's really not any different from a hypothesis. It really is a hypothesis about the outcome of an experiment we haven't done yet, which then collapses to the true or false after the experiment is done and the result is observed.

Some experiments may have more than two possible outcomes. For example, suppose our black box under investigation contains some colored balls in it, and we can stick a hand in it and pull out a ball that might be of one of three colors: blue, yellow or green. This can be treated as three separate events:

Eblue: the ball is blue
Eyellow: the ball is yellow
Egreen: the ball is green

The total of all three probabilities will still be 1:

P(Eblue) + P(Eyellow) + P(Egreen) = 1

The connection between the probabilities of events means that these events are not independent. Observing any of these three events changes the probabilities of the other two events.

But the sum of probabilities in each pair will also be 1:

P(Eblue) + P(~Eblue) = 1
P(Eyellow) + P(~Eyellow) = 1
P(Egreen) + P(~Egreen) = 1

The connected events can be processed sequentially. We can predict the probabilities of the events that haven't been processed yet after each step of processing. We can achieve this in the real world by having an assistant pull out the ball out of the box and then asking him questions with the answers of "yes" or "no". Suppose we know for some reason that there is an equal probability for each color. Thus initially:

P(Eblue) = 1/3
P(~Eblue) = 2/3

P(Eyellow) = 1/3
P(~Eyellow) = 2/3

P(Egreen) = 1/3
P(~Egreen) = 2/3

Then we have the assistant pull out the ball and ask him: "Is the ball blue?" If the answer is "No" then the event probabilities change:

P(Eblue) = 0
P(~Eblue) = 1

P(Eyellow) = 1/2
P(~Eyellow) = 1/2

P(Egreen) = 1/2
P(~Egreen) = 1/2

The probabilities of Eblue and ~Eblue have collapsed to 0 and 1 but the remaining two outcomes are now equi-probable between two, not three.

Now we can ask the next question: "Is the ball yellow?". If the answer is again "No" then the probabilities change like this:

P(Eblue) = 0
P(~Eblue) = 1

P(Eyellow) = 0
P(~Eyellow) = 1

P(Egreen) = 1
P(~Egreen) = 0

At this point we also become quite sure that the ball must be green, and we can as well skip asking about it. Not in the real world though. In the real world something might have changed since we've made our original estimations and more colors became available to the box-builder. Or maybe the fourth outcome is extremely rare and our training data didn't contain any example of it. Thus it could happen that we ask "Is the ball green?" expecting to hear "Yes" and get a "No". Obviously, this would be a curve-ball that disturbs the model, but I'll show a possible way of dealing with it.

Bayes 1: basic terms

A couple of years ago, while at Google, I've done a "20% project" of a bayesian expert system. The expert system existed before me but I was able to improve it. While doing so, I've figured out a few things about the Bayes logic. Maybe they're commonplace knowledge but I haven't seen them on the Internet, and moreover, most of the descriptions I've seen on the Internet tend to get things somewhat wrong. Especially in Wikipedia. Obviously, I can't talk about the expert system itself, and after a couple of years I don't remember that much about it anyway. But I took notes about the general formulas I've figured out. I've recently found those notes, and want to share them.

This has really nothing to do with CEP, though I guess the CEP can be used to drive the expert systems. I just want to share this knowledge somewhere, and why not here?

Some 25 years or so ago I've read a British collection of articles on the artificial intelligence. One of these articles was on the expert systems based on the bayesian formulas, which quite impressed me. Then I've played a little more with this stuff in college. I vaguely remembered that the author of that article had developed the system called Microexpert but now what I can find about Microexpert doesn't seem to quite match it, so maybe that one was from another article.

When I started working on that recent expert system and went to refresh my knowledge, I've thought, by now Bayes is used all over the place, and the descriptions should be all over the Internet, right? Nope. Not only there isn't much, but also a lot of them apply the Bayes formula somewhat wrong. So I've had to restore this stuff from vague memories and logical reasoning.

First I'll describe what I've restored from my memory and then I'll go into what I've come up with while actually mucking with the realities of the expert systems. This might be well known and published in some obscure articles or maybe even in well-known books, but I don't know where (if you have pointers, you're welcome to comment). But it's not on the Internet and not in a popular form. I want to put it there.

Let's start with the basics. The basic description of the formula created by Bayes actually can be found all over the Internet. You can look at Wikipedia (in a crappy article), or say read a lengthy explanation at or . But I'll try to give my own short descriptions and then go further. Read on.

The probability theory talks about hypotheses and events. Both of them relate to some system being examined, some kind of a "black box" whose contents we're trying to guess based on observations. If we could just look inside the box, there would be no need for all this, but in reality we could only poke the box in certain ways and then have to deduce its contents to the best of our knowledge based on that limited experimentation. The contents of the box is usually not very random. We normally know a good deal about what could be in the black boxes we get, we just don't know what is in this particular black box. The goal of an expert system is to look at a black box, do some poking, and classify the contents of the black box to one of the known categories with a good certainty. The expert system example in that old British book was trying to diagnose the automobile breakages. A medical expert system would be trying to diagnose what sickness a patient has.

It's kind of intuitive that the hypotheses are the descriptions of what might be in the box, and thus we're trying to select which contents seems most likely based on out experimentations, what hypothesis is most probable.

The events are the experiments. We might be able to select the experiments to perform or might be limited to observing what the mechanism in the black box does on its own.

The hypotheses are usually denoted H or Hsomething. The events are usually denoted E or Esomething.

Sunday, August 9, 2015

something old

Hey, it turns out that the patent for the dynamic schema modification that I worked on back in the Aleri days got actually issued, way back in 2009:

I've totally missed it and thought that it was still mired in the process. Looks like it took only 8 months from the final filing to the issue.

Wednesday, June 24, 2015

On simplicity

Today someone brought to my attention the quote from :

<<And then there is one element of my quality metric which seems to be at  the heart of the controversy: I believe that computers are meant to  serve people and not the other way around.  That means that if something
 inherently simple is difficult to do, it's wrong.

Not  everyone agrees with this.  A surprising number of people like things to be difficult.  I have never understood this, but it is a fact.  Some  people revel in the intellectual studliness of mastering, say, C++
template programming or x86 assembler.  I don't.  I think C++ is a pain in the ass.  Life is short, and I would rather spend my allotted time skiing or making love to my wife than worrying about whether or not I need to define a virtual destructor in my base class in order to avoid memory leaks.  And that means that when I want, say, for something to appear centered on the screen, I want to be able to do it by typing "center" rather than "margin:auto".  If I can't tell a computer to center something by saying "center", then as far as I'm concerned the computer is wrong, not me.

Like I said, not everyone buys in to this worldview.  I don't get it, but it is surprisingly common.  I think it might have something to do with job security.  I first encountered the complicated-is-beautiful mindset when I was working for NASA.  I had an idea for how to simplify spacecraft sequencing and make it less error prone.  I figured everyone would love it.  To the contrary, the establishment fought the idea tooth-and-nail (and still does to this day).  And this attitude is everywhere.  It's the reason the tax code is so fucked up.  It's the reason the law is so byzantine.  It's the reason that languages like C++ and Perl thrive.>>

I see a major issue with this kind of thinking. Now, let me get this straight: I don't normally do the web stuff, so i don't have a whole lot of opinion about CSS as such. For what opinion I have, I agree with Ron that CSS sucks and personally I like to see everything in one file without artificially splitting it. Nothing is worse than dealing with lots of small files with cross-dependencies. For what it's worth, I hate JavaScript and the web sites written with it, why do they have to be so slow and annoying?

But let's return to the subject, about the simplicity. I think there is more than one failure on Ron's part.

First of all, he seems to miss that everyone has a different definition of simplicity. And that applies not only to him but to many other people, and not only to the simplicity but to many other subjects. There are many people who like to quote the Principle of the Least Surprise. But then they go and propose the solutions that surprise me immensely.

My favorite example is a poll where people were asked, whether they think that they're better drivers than average, and more than 80% of them answered "yes". And then the article about that poll went on to talk about how people overestimate themselves. What the article failed to appreciate though is that everyone has a different definition of what a "good driver" is. Some people think that it means driving faster, some think that it means driving slower, some think that showing the turn signals is important, some don't know about the existence of the turn signals. Sure enough, most people strive to be good drivers by their own definition, and really become the good drivers by their own definition. The remaining 20% must have been the really lousy drivers.

So in my reckoning C++ and Perl are the simple languages. They allow for simple and easy writing and reading of the code. Yeah, C++ has some strange points, like why not make these virtual destructors mandatory if you define the virtual methods? And the early C++ really was a pain. But after the addition of templates it became awesome. It's not that you want to use the templates for everything, and you specifically don't want to use the stuff from the Alexandrescu book in any actual project. And they really need the explicit loops instead of the stupid functional recursion, and a better error diagnostics. But when they're used right, they're awesome. The modern C++ really combines the simplicity aspects of C and Perl. And Perl is an awesome simple language. It's driven by how to make things easy and convenient to use.

See, there are at least two definitions of simplicity. One definition thinks that simplicity equals logic and consistency, that everything that is logical and consistent is simple. The other definition thinks that things should be easy to use. The typical problems need to be solvable with minimal effort, and if that requires inconsistency, so be it, the consistency is of not much use by itself. If we look at the classic programming languages, C is convenient while Pascal is all-consistent (and oh so much pain to use). Perl is convenient, Python is consistent (and oh so much pain to use, have you ever tried to read the code in Python?). C++ is convenient, Java is consistent (and is just incredibly painful to use, it's like raising the sun with manual labor). Shell is convenient, PowerShell is consistent (and well, you've got my drift).

Second, the wrong kind of simplicity is bad. During my time at Google, I've actually heard with my own ears someone say: "The system has got so many features to it that it became difficult to modify, so we had to throw it away and rewrite from scratch without the extra features". This is I think a pinnacle of how things can go wrong. Yes, a system without many features is much easier to write. And probably keeping just 80% of the features would make is much easier to maintain. But there is a good reason for the features, they make the system easy to use. Nobody needs a system with 80% of the features because it becomes unusable. And well, if someone can't understand and maintain it, well, they're probably the wrong people to do the job.

That said, of course there is a limit: there are features that make the life of the users convenient and the features for the sake of features. The first kind is good, the second kind is bad, and telling which is which may sometimes be pretty hard. It might not even be that someone (er, marketing) had included the features for the sake of features, they might have been a good idea at some point and then got supplanted by the better ways. But things evolve, and sometimes there is time for some features to die. But not wholesale and not by the principle of simplicity of implementation or for some dumb consistency. The best and most simple-to-use features are often difficult to implement and full of quirks. But that's because the tasks they're solving are difficult and full of quirks, and the features make these tasks easy by covering the complexity and adopting to the quirks.

Wednesday, April 22, 2015

On performace of reading the unaligned data

I've made a template that was supposed to optimize the reading of the unaligned data:

template <typename T>
T getUnaligned(const T *ptr)
    if ((intptr_t)ptr & (sizeof(T)-1)) {
        T value;
        memcpy(&value, ptr, sizeof(T));
        return value;
    } else {
        return *ptr;

And I've been wondering if it really makes a difference. So I've made a test.The test compares 5 cases:

Just read the aligned memory directly;
Read the aligned memory through memcpy;
Read the aligned memory through memcpy;
Read the aligned memory through the macro;
Read the unaligned memory through the macro.

All the cases follow this general pattern:

UTESTCASE readAlignedDirect(Utest *utest)
    int64_t n = findRunCount();
    int64_t x = 0;

    double tstart = now();
    for (int64_t i = 0; i < n; i++) {
        x += apdata[i % ARRAYSZ];
    double tend = now();
    printf("        [%lld] %lld iterations, %f seconds, %f iter per second\n", (long long) x, (long long)n, (tend-tstart), (double)n / (tend-tstart));

The results on an AMD64 machine with -O3 came out like this:

   case 1/6: warmup
        [0] 4294967296 iterations, 4.601818 seconds, 933319757.968194 iter per second
  case 2/6: readAlignedDirect
        [0] 4294967296 iterations, 4.579874 seconds, 937791527.916276 iter per second
  case 3/6: readAlignedMemcpy
        [0] 4294967296 iterations, 4.537628 seconds, 946522579.007429 iter per second
  case 4/6: readUnalignedMemcpy
        [0] 4294967296 iterations, 6.215574 seconds, 691000961.046305 iter per second
  case 5/6: readAlignedTemplate
        [0] 4294967296 iterations, 4.579680 seconds, 937831317.452678 iter per second
  case 6/6: readUnalignedTemplate
        [0] 4294967296 iterations, 5.052706 seconds, 850033049.741168 iter per second

Even more interestingly, the code generated for readAlignedDirect(), readAlignedTemplate() and readUnalignedTemplate() is exactly the same:

    movq    %rax, %rdx
    addq    $1, %rax
    andl    $15, %edx
    addq    (%rcx,%rdx,8), %rbx
    cmpq    %rbp, %rax
    jne .L46

The code generated for readAlignedMemcpy() and readUnalignedMemcpy() is slightly different but very similar:

    movq    %rax, %rdx
    addq    $1, %rax
    andl    $15, %edx
    salq    $3, %rdx
    addq    (%rcx), %rdx
    movq    (%rdx), %rdx
    addq    %rdx, %rbx
    cmpq    %rax, %rbp
    movq    %rdx, (%rsi)
    jg  .L34

Basically, the compiler is smart enough to know that the machine doesnt' care about alignment, and generates memcpy() as a plain memory read, and then for condition in the template it's also smart enough to know that both branches end up with the same code and eliminate the condition check.

Smart, huh? So now I'm not sure if keeping this template as it is makes sense, or should I just change it to a plain memcpy().

BTW,  I must say it took me a few attempts to make up the code that would not get optimized out by the compiler. I've found out that memcpy() is not suitable for reading the volatile data nowadays. It defines its second argument as (const void *), and so the volatility has to be cast away for passing the pointer to it, and then the compiler just happily takes the whole memory read out of the loop. And it was even smart enough to sometimes replace the whole loop with a multiplication instead of collecting a sum. That's where the modulo operator came handy to confuse this extreme smartness.

A separate interesting note is that each iteration of the loop executes in an about one-billionth of a second. Since this is a 3GHz CPU, this means that every iteration takes only about 3 CPU cycles.  That's 6 or 10 instructions executed in 3 CPU cycles. Such are the modern miracles.

ThreadedClient improvement of timeout handling

The work on updating the tests for the new ordered index brought a nice side effect: the handling of the timeouts in the Perl class X:ThreadedClient has improved. Now it returns not only the error message but also the data received up to that point. This helps a lot with diagnosing the errors in the automated tests of TQL: the error messages get returned in a clear way.

Friday, April 17, 2015

Ordered Index implemented in C++

The ordered index implemented in Perl has been pretty slow, so I've finally got around to implementing one in C++. It became much faster. Here is an excerpt from the performance test results (with an optimized build):

Table insert makeRowArray (single hashed idx, direct) 0.551577 s, 181298.48 per second.
  Excluding makeRowArray 0.311936 s, 320578.44 per second.
Table insert makeRowArray (single ordered int idx, direct) 0.598462 s, 167095.09 per second.
  Excluding makeRowArray 0.358821 s, 278690.37 per second.
Table insert makeRowArray (single ordered string idx, direct) 1.070565 s, 93408.64 per second.
  Excluding makeRowArray 0.830924 s, 120347.91 per second.
Table insert makeRowArray (single perl sorted idx, direct) 21.810374 s, 4584.97 per second.
  Excluding makeRowArray 21.570734 s, 4635.91 per second.
Table lookup (single hashed idx) 0.147418 s, 678342.02 per second.
Table lookup (single ordered int idx) 0.174585 s, 572785.77 per second.
Table lookup (single ordered string idx) 0.385963 s, 259092.38 per second.
Table lookup (single perl sorted idx) 6.840266 s, 14619.32 per second.

Here the "ordered" is the new ordered index in C++ (on an integer field and on a string field), and "perl sorted" is the ordering in Perl, on an integer field. Even though the ordered index is a little slower than the hashed index, it's about 40-70 times faster than the ordering implemented in Perl.

Naturally, ordering by a string field is slower than by an integer one, since it not only has to compare more bytes one-by-one but also honors the locale-specific collation order.

In Perl, the ordered index type is created very similarly to the hashed index type:

Triceps::IndexType->newOrdered(key => [ "a", "!b" ])

The single option "key" specifies the array with the names key fields of the index. The "!" in front of the field name says that the order by this field is descending, and otherwise the order is ascending. This is more compact and arguably easier-to-read format than the one used by the SimpleOrderedIndex.

As usual, the NULL values in the key fields are permitted, and are considered less than any non-NULL value.

The array fields may also be used in the ordered indexes.

The ordered index does return the list of key fields with getKey(), so it can be used in joins and can be found by key fields with TableType::findIndexPathForKeys() and friends, just like the hashed index.

The getKey() for an ordered index returns the list of plain field names, without the indications of descending order. Thus the finding of the index by key fields just works out of the box, unchanged.

To get the key fields as they were specified in the index creation, including the possible "!", use the new method:

@fields = $indexType->getKeyExpr();

The idea of this method is that the contents of the array returned by it depends on the index type and is an "expression" that can be used to build another instance of the same index type. For the hashed index it simply returns the same data as getKey(). For the ordered index it returns the list of keys with indications. For the indexes with Perl conditions it still returns nothing, though in the future might be used to store the condition.

In the C++ API the index type gets created with the constructor:

OrderedIndexType(NameSet *key = NULL);
static OrderedIndexType *make(NameSet *key = NULL);

The key can also be set after construction:

OrderedIndexType *setKey(NameSet *key);

Same as in Perl, the field names in the NameSet can be prefixed with a "!" to specify the descending order on that field.

The new method in the IndexType is:

virtual const NameSet *getKeyExpr() const;

And the new constant for the index id is IT_ORDERED, available in both C++ and Perl.

Sunday, March 22, 2015

more of performance numbers, with optimization

I've realized that the previous published  performance numbers were produced by a build without optimization. So I've enabled the optimization with -O3 -fno-strict-aliasing, and the numbers improved, some of them more than twice (that's still an old Core 2 Duo 3GHz laptop):

Performance test, 100000 iterations, real time.
Empty Perl loop 0.005546 s, 18030711.03 per second.
Empty Perl function of 5 args 0.031840 s, 3140671.52 per second.
Empty Perl function of 10 args 0.027833 s, 3592828.57 per second.
Row creation from array and destruction 0.244782 s, 408526.42 per second.
Row creation from hash and destruction 0.342986 s, 291557.00 per second.
Rowop creation and destruction 0.133347 s, 749922.94 per second.
Calling a dummy label 0.047770 s, 2093352.57 per second.
Calling a chained dummy label 0.049562 s, 2017695.16 per second.
  Pure chained call 0.001791 s, 55827286.04 per second.
Calling a Perl label 0.330590 s, 302489.48 per second.
Row handle creation and destruction 0.140406 s, 712218.40 per second.
Repeated table insert (single hashed idx, direct) 0.121556 s, 822665.80 per second.
Repeated table insert (single hashed idx, direct & Perl construct) 0.337209 s, 296551.58 per second.
  RowHandle creation overhead in Perl 0.215653 s, 463707.00 per second.
Repeated table insert (single sorted idx, direct) 1.083628 s, 92282.62 per second.
Repeated table insert (single hashed idx, call) 0.153614 s, 650981.13 per second.
Table insert makeRowArray (single hashed idx, direct) 0.553364 s, 180713.02 per second.
  Excluding makeRowArray 0.308581 s, 324063.65 per second.
Table insert makeRowArray (double hashed idx, direct) 0.638617 s, 156588.31 per second.
  Excluding makeRowArray 0.393835 s, 253913.40 per second.
  Overhead of second index 0.085254 s, 1172969.41 per second.
Table insert makeRowArray (single sorted idx, direct) 22.355793 s, 4473.11 per second.
  Excluding makeRowArray 22.111011 s, 4522.63 per second.
Table lookup (single hashed idx) 0.142762 s, 700466.78 per second.
Table lookup (single sorted idx) 6.929484 s, 14431.09 per second.
Lookup join (single hashed idx) 2.944098 s, 33966.26 per second.
Nexus pass (1 row/flush) 0.398944 s, 250661.96 per second.
Nexus pass (10 rows/flush) 0.847021 s, 1180608.79 per row per second.
  Overhead of each row 0.049786 s, 2008583.51 per second.
  Overhead of flush 0.349157 s, 286403.84 per second.

I've also tried to run the numbers in a newer laptop with 2GHz Core i7 CPU, in a VM configured with 2CPUs. On the build without the optimization, the numbers came out very similar to the old laptop. On the build with optimization they came up to 50% better but not consistently so (perhaps, running in a VM added variability).

Sunday, March 1, 2015

Triceps 2.0.1 released

This time I'm trying to update the docs following the code changes, and do the small releases.

The release 2.0.1 is here! It includes:

  • Fixed the version information that was left incorrect (at 0.99).
  • Used a more generic pattern in tests for Perl error messages that have changed in the more recent versions of Perl (per CPAN report #99268).
  • Added the more convenient way to wrap the error reports in Perl, Triceps::nestfess() and Triceps::wrapfess().
  • Added functions for the nicer printing of auto-generated code, Triceps::alignsrc() and Triceps::numalign().
  • In the doc chapter on the templates, fixed the output of the examples: properly interleaved the inputs and outputs.

Monday, December 29, 2014

formatting the source code snippets

I've got to printing the snippets of the source code in the error messages when the said code fails to compile and realized that it could be printed better. So I wrote a couple of helper functions to print it better.

First of all, it could be indented better, to match the indenting of the rest of the error message.

Second (but connected), it could be left-aligned better:  the code snippets tend to have the extra spacing on the left.

And third, it could use the line numbers, to make the error location easier to find.

The magic is done with two functions, Triceps::Code::alignsrc() and Triceps::Code::numalign(). They work in the exact same way, have the same arguments, only numalign() adds the line numbers while alignsrc() doesn't.

Here is an example of use:

        confess "$myname: error in compilation of the generated function:\n  $@function text:\n"
        . Triceps::Code::numalign($gencode, "  ") . "\n";

It can produce an error message like this (with a deliberately introduced syntax error):

Triceps::Fields::makeTranslation: error in compilation of the generated function:
  syntax error at (eval 27) line 13, near "})
function text:
     2 sub { # (@rows)
     3   use strict;
     4   use Carp;
     5   confess "template internal error in Triceps::Fields::makeTranslation: result translation expected 1 row args, received " . ($#_+1)
     6     unless ($#_ == 0);
     7   # $result_rt comes at compile time from Triceps::Fields::makeTranslation
     8   return $result_rt->makeRowArray(
     9     $_[0]->get("one"),
    10     $_[0]->get("two"),
    11   );
    12 })
 at /home/babkin/w/triceps/trunk/perl/Triceps/blib/lib/Triceps/ line 219
    Triceps::Fields::makeTranslation('rowTypes', 'ARRAY(0x2943cb0)', 'filterPairs', 'ARRAY(0x2943b30)', '_simulateCodeError', 1) called at t/Fields.t line 205
    eval {...} called at t/Fields.t line 204

The first argument of alignsrc() and numalign() is the code snippet string to be aligned. The following will be done to it:

  • The empty lines at the front will be removed. numalign() is smart enough to take the removed lines into account and adjust the numbering. That's why the numbering in the error message shown above starts with 2. You can also get the number of the removed lines afterwards, from the global variable $Triceps::Code::align_removed_lines.
  • The \n at the end of the snippet will be chomped. But only one, the rest of the empty lines at the end will be left alone.
  • Then the "baseline" indenting of the code will be determined by looking at the first three and last two lines. The shortest non-empty indenting will be taken as the baseline. If some lines examined start with spaces and some start with tabs, the lines starting with tabs will be preferred as the baseline indenting.
  • The baseline indenting will be removed from the front of all lines. If some lines in the middle of the code have a shorter indenting, they will be left unchanged.
  • The tabs will be replaced by two spaces each. If you prefer a different replacement, you can specify it as the third argument of the function.
  • In numalign() the line numbers will be prepended to the lines.
  • The indenting from the second argument of the function will be prepended to the lines.

The second argument  contains the new indenting that allows to align the code nicely with the rest of the error message. Technically, it's optional, and will default to an empty string.

The third argument is optional and allows to provide an alternative replacement to the tab characters in the code. If it's an empty string, it will revert to the default two spaces "  ".  To keep the tabs unchanged, specify it as "\t".

Friday, December 26, 2014

error reporting in the templates

When writing the Triceps templates, it's always good to make them report any usage errors in the terms of the template (though the extra detail doesn't hurt either). That is, if a template builds a construction out of the lower-level primitives, and one of these primitives fail, the good approach is to not just pass through the error from the primitive but wrap it into a high-level explanation.

This is easy to do if the primitives report the errors by returning them directly, as Triceps did in the version 1. Just check for the error in the result, and if an error is found, add the high-level explanation and return it further.

It becomes more difficult when the errors are reported like the exceptions, which means in Perl by die() or confess(). The basic handling is easy, there is just no need to do anything to let the exception propagate up, but adding the extra information becomes difficult. First, you've got to explicitly check for these errors by catching them with eval() (which is more difficult than checking for the errors returned directly), and only then can you add the extra information and re-throw. And then there is this pesky problem of the stack traces: if the re-throw uses confess(), it will likely add a duplicate of at least a part of the stack trace that came with the underlying error, and if it uses die(), the stack trace might be incomplete since the native XS code includes the stack trace only to the nearest eval() to prevent the same problem when unrolling the stacks mixed between Perl and Triceps scheduling.

Because of this, some of the template error reporting got worse in Triceps 2.0.

Well, I've finally come up with the solution. The solution is not even limited to Triceps, it can be used with any kind of Perl programs. Here is a small example of how this solution is used, from Fields::makeTranslation():

    my $result_rt = Triceps::wrapfess
        "$myname: Invalid result row type specification:",
        sub { Triceps::RowType->new(@rowdef); };

The function Triceps::wrapfess() takes care of wrapping the confessions. It's very much like the try/catch, only it has the hardcoded catch logic that adds the extra error information and then re-throws the exception.

Its first argument is the error message that describes the high-level problem. This message will get prepended to the error report when the error propagates up (and the original error message will get a bit of extra indenting, to nest under that high-level explanation).

The second argument is the code that might throw an error, like the try-block. The result from that block gets passed through as the result of wrapfess().

The full error message might look like this:

Triceps::Fields::makeTranslation: Invalid result row type specification:
  Triceps::RowType::new: incorrect specification:
    duplicate field name 'f1' for fields 3 and 2
    duplicate field name 'f2' for fields 4 and 1
  Triceps::RowType::new: The specification was: {
    f2 => int32[]
    f1 => string
    f1 => string
    f2 => float64[]
  } at blib/lib/Triceps/ line 209.
    Triceps::Fields::__ANON__ called at blib/lib/ line 192
    Triceps::wrapfess('Triceps::Fields::makeTranslation: Invalid result row type spe...', 'CODE(0x1c531e0)') called at blib/lib/Triceps/ line 209
    Triceps::Fields::makeTranslation('rowTypes', 'ARRAY(0x1c533d8)', 'filterPairs', 'ARRAY(0x1c53468)') called at t/Fields.t line 186
    eval {...} called at t/Fields.t line 185

It contains both the high-level and the detailed description of the error, and the stack trace.

The stack trace doesn't get indented, no matter how many times the message gets wrapped. wrapfess() uses a slightly dirty trick for that: it assumes that the error messages are indented by the spaces while the stack trace from confess() is indented by a single tab character. So the extra spaces of indenting are added only to the lines that don't start with a tab.

Note also that even though wrapfess() uses eval(), there is no eval above it in the stack trace. That's the other part of the magic: since that eval is not meaningful, it gets cut from the stack trace, and wrapfess() also uses it to find its own place in the stack trace, the point from which a simple re-confession would dump the duplicate of the stack. So it cuts the eval and everything under it in the original stack trace, and then does its own confession, inserting the stack trace again. This works very well for the traces thrown by the XS code, which actually doesn't write anything below that eval; wrapfess() then adds the missing part of the stack.

Wrapfess() can do a bit more. Its first argument may be another code reference that generates the error message on the fly:

    my $result_rt = Triceps::wrapfess sub {
            "$myname: Invalid result row type specification:"
        sub { Triceps::RowType->new(@rowdef); };

In this small example it's silly but if the error diagnostics is complicated and requires some complicated printing of the data structures, it will be called only if the error actually occurs, and the normal code path will avoid the extra overhead.

It gets even more flexible: the first argument of wrapfess() might also be a reference to a scalar variable that contains either a string or a code reference. I'm not sure yet if it will be very useful but it was easy to implement. The idea there is that it allows to write only one wrapfess() call and then change the error messages from inside the second argument, providing different error reports for its different sections. Something like this:

    my $eref;
    return Triceps::wrapfess \$eref,
        sub {
          $eref = "Bad argument foo";
          $eref = sub {
            my $bardump = $argbar->dump();
            $bardump =~ s/^/    /mg;
            return "Bad argument bar:\n  bar value is:\n$bardump";


It might be too flexible, we'll see how it works.

Internally, wrapfess() uses the function Triceps::nestfess() to re-throw the error. Nestfess() can also be used directly, like this:

eval {
if ($@) {

  Triceps::nestfess("High-level error message", $@);

The first argument is the high-level descriptive message to prepend, the second argument is the original error caught by eval. Nestfess() is responsible for all the smart processing of the indenting and stack traces, wrapfess() is really just a bit of syntactic sugar on top of it.

Saturday, December 20, 2014

nested record and ORM

Some more thoughts on the subject. First of all, why should anyone care about the nested records?

Nowadays people are used to the relational model. But next to all of this sits Google with its protobufs. It has published the information about protobufs, and about some of its systems that use protobufs but I think there is not enough appreciation of how much the protobufs mean for Google. If you look at the published information, you'll realize that everything at Google, totally everything, is based on protobufs. They're used as the arguments of the RPC calls and as the returned values, the map-reduce is built on them, as well as all kinds of the no-SQL DBMS and a few of SQL DBMS. And there are good reasons for why they're used so much. Well, for the no-SQL databases it's a no-brainer: if all it can do is store the key-value pairs, at least the values must be capable of representing complexity. To understand the other reasons, let's look at the difference between the nested records and the Object Relation Managers.

Entirely coincidentally, I've encountered recently a reference to the article at, on the subject of ORM. It's an interesting viewpoint though the conclusions don't seem quite right to me, as well as a few of the pre-suppositions.

The idea that the object model has the concept of identity and the relational model doesn't, doesn't seem right. In reality the primary keys are used all over the real databases. And the primary keys are exactly that, the identity. Even more so if the primary key is not a part of data but an auto-generated unique value. Here we can go into the discussions about "but it's just a hack, not a part of the theory" but think about this: these auto-generated unique keys are the bread and butter of the data normalization, which certainly is the part of the theory. The point of normalization is to get rid of the duplicate data in the rows and move it to a separate table where it will exist as a single copy, with multiple short references to it from the original table. But so are the references between the objects, multiple objects may refer to the same subordinate object. And the reference to that object's address (its identity) is just the same thing as the references to that unique key in the subordinate table.

So, what ORM does is map one-to-one the objects and records, and maps the references between the objects by address to these references between the records by the unique id values. The joins are then the way to navigate the references between the objects.

The nested records in protobufs (or their equivalents) have a different meaning and are used differently. The whole point of a nested record is that it represents a self-contained unit of data. This data is all stored together because it's normally used together. It's obvious to see when the protobufs are used for RPC but it works the same when the protobufs are used in a database. Obviously, not everything in the world can be covered by this usage and obviously there are downsides. An obvious downside is that a nested record  in a protobuf can't be modified piecemeal: if you need to change it, you have to build a whole new record and put it back in place. And obviously the data is stored by value, not by reference, so the duplicate data has to be duplicated. But for the situations where the self-containment is real important, they work great. As in all the massively-parallel processing at Google. Each nested record has enough information in it to be processed separately from the others. Again, not everything in the world can be expressed like this, but what can, benefits tremendously from the nested records.

Now I think I can formulate, what is the difference between the ORM and the nested records: the ORMs connect the objects by reference while the nested records bunch up the data by value. Different uses, different trade-offs, different implementations. (Though for the in-memory implementations the boundary is more blurred).

And I think I've found the explanation to why they say that the Oracle's nested records are slower than the manual joins, even though they amount to joins, the the ORM-like way. Well, or course it could just be a limitation of the query optimizer: if the optimizer runs first and only then the nesting is expanded, it obviously would not be as optimal. But that should be easy to fix, just expand the nesting and then run the optimizer, so this is unlikely to be the problem. A more fundamental problem seems to be that the records are sent into the database with the data by value, and then they get broken up into the flat tables and stored with connections by reference. It successfully combines the data duplication of by-value and the cost of the extra joins from by-reference. Both kinds of shortcomings rolled together.

Coming back to that point that not everything in the world can be represented as the nested records in protobufs. This means that the references between the tables (or objects, if you prefer to think about them that way) are still needed, and the nested records still need to use them. Which means fitting them into the relational model. And that's where my whitepaper comes in: how to represent the nested records in the relational model. They can be logically dis-joined into the "underlying tables" and the relational operations can be applied to them in the usual way.

Which can also solve problem if the representation of the object inheritance and polymorphism in ORMs, described in the article I listed above. Well, representing the polymorphic data as nested records is fairly straightforward as long as the records contain the indication of subtypes.  And as I've shown in the whitepaper, mapping the nested records to the relational use is also straightforward.  Put them together and you have the solution.

Thursday, December 18, 2014

more on nested records (and Oracle)

As people have pointed out to me, Oracle has the concept of "nested tables" and "variable arrays" (VARRAY). Since at least the version 9i, a very long time ago.

The nested tables have a difference in how they store the data: they break out the repeating sections into a separate table and automatically create the id fields to link them together. So the select just does an implicit join. It's like an embedded ORM (by the way, there is also a concept of references to rows, also apparently implemented in an ORM way). Strangely, they say that the nested tables are slower than the usual joins. Not sure, why.

The VARRAYs are more like the real nested structures, they store the data directly in the row, and they preserve the order of nested elements. However the maximal size of VARRAYs has to be declared in advance.

Obviously, the nested tables (the ORM version) would be much faster on updates, they can change the nested parts very easily without changing the rest of the records. And also faster on scanning of the non-repeating parts.

The version with the repeated parts stored directly in the records, such as protobufs or varrays should be faster on reading these repeated parts.

The syntax for accessing the nested parts is like this:

  main_table upper,
  table(upper.nested) nested;

So you can select the nested fields as a single field of a record type or you can flatten it as shown above. And the joins within the same top-level records should be possible to express in the same syntax like this:

  main_table upper,
  table(upper.nested1) nested1,
  table(upper.nested2) nested2;

There also are the examples of deleting the nested parts, like:

delete from table (
  select nested from upper where key = 1234
) nested
  nested.field = 5678;

Or even by the whole value of the nested part.

And by the same example, you can insert a nested part into the record, or update one.

But the part that seems to be missing is the un-flattening. Can you select data from one table with nested data, re-format the nested parts, create different nested records and either return them from select or insert them into another table? I haven't found any examples of that. It looks like the nested records aren't the first-class citizen in Oracle either.

Friday, December 12, 2014

Relational Operations on the Nested Data

I wrote this little whitepaper that I don't know where to put, so I'll just put it here. It gives some interesting ideas for the future development of Triceps. The PDF version can also be found at

It might also help to read first my explanation of why the nested data is important and how it differs from the Object Relations Managers at

Relational Operations on the Nested Data

Sergey A. Babkin


This paper shows a way to integrate the RDBMS and the nested records to the mutual benefit. The nested records dovetail with the relational model to cover the weaknesses of the later, such as the representation of the multi-dimensional 1:M joins, and allow an optimization of other kinds of jons.


The use of the nested data structures seems to be growing in popularity for the distributed data processing: Google has Protobufs, Hadoop and Bing have their own formats, XML has been around for a long time, and so on. Among other things, they are used as records in the non-relational and kind-of-relational DBMS. While at Google, I've used the Dremel system that implements a read-only columnar database over the tables of nested records in the protobuf format, and while it was convenient in some ways, in the other ways the operations in it have felt unnecessarily difficult. To be honest, by now I can't even formulate why exactly it felt so: it's been a while since I've used Dremel, and by now I forgot the details, I have to guess from the logical considerations in the published description in [Dremel] ( The theory behind it is described very nicely in [Nested] ( and in the documents it refers to. It was surprising to find out that their model is based on a paper from 1988 [Report] (, and apparently nothing much had happened since then. Everything I've found on the subject seems to be based on [Report].

In this paper I want to show that:
1.       There is more to the nested data structures than described in [Report].
2.       There is a straightforward way to map the nested data to the relational operations and take advantage of its locality.
3.       A common record-based (as opposed to columnar, such as Dremel) RDBMS can be adapted to take advantage of the nested records with a small set of changes.

To forestall a typical question, this is not a proposal for a way of doing ORM that translates from objects to flat tables, and is not in any way related to ORM. It is about the adoption of the nested records into the relational theory.

Nested data is multi-dimensional

The [Report] deals with the nesting but all the nesting discussed there is one-dimensional. The schema of an example from there can be written out in a pseudo-code similar to Google protobufs in Listing 1. Even though the nesting is two-deep, it goes in only one direction, growing in one dimension.

struct client {

  string name;

  string address;

  repeated struct company {

    string name;

    repeated struct share {

      float price;

      string date;

      int quantity;

    } shares;

  } companies;

Listing 1. One-dimensional nested data structure.

But the real-world nested records tend to have multiple repeated elements. For example, consider a structure describing a theatrical company. It might list the pieces produced by this company in the current season and the actors in the company, as shown in Listing 2. This structure has two separate repeating fields going in parallel at the same level, expanding in two dimensions.

struct company {

  string name;

  repeated struct piece { // dimension 1

    string name;

    string start_date;

    string end_date;

  } pieces;

  repeated struct actor { // dimension 2

    string name;

    string bio;

  } actors;

Listing 2. Two-dimensional nested data structure.

For another example, the [Dremel] paper describes a structure shown in Listing 3.

struct  Document {

  int64 DocId;

  struct {

    repeated int64 Backward; // dimension 1

    repeated int64 Forward; // dimension 2

  } Links;

  repeated struct { // dimension 3

    repeated struct Language {

      string Code;

      string Country;


    string Url;

  } Names;

Listing 3. Three-dimensional nested data structure from [Dremel] paper.

This example has four repeated elements in three dimensions, one of the dimensions nesting two-deep.

Or there might be further cross-references in the nested data, for example, the theatrical company description may have a list of actors for each piece, as show in Listing 4. Rather than repeating actor's biography in each piece, it can be found by name, descending down the other dimension. Note that the field piece. actor_name is also repeated, creating a nesting of two levels deep.

struct company {

  string name;

  repeated struct piece { // dimension 1

    string name;

    string start_date;

    string end_date;

    repeated string actor_name; // cross-reference to dimension 2

  } pieces;

  repeated struct actor { // dimension 2

    string name;

    string bio;

  } actors;

Listing 4. Two-dimensional nested data with nested cross-references.

The [Report] doesn't provide a solution for the multi-dimensional data. There even are papers like [NoGood] ( )that argue that the nested structures are not a good idea to start with, citing in particular the cross-references like shown above between the pieces and the actors. But with a little different approach, the nested structures are quite useful.

Nested data is the result of a Join

People argue everywhere, including Wikipedia, that the nested data saves the need for joins. And it looks obvious, yet I haven't seen this fact stated before in another form, looking from a different standpoint:

A nested structure can be thought of as the result of a join.

The abovementioned theatrical company data from Listing 4 can be represented in the pure relational way as multiple joined tables (shown here in an SQL-like pseudo-code), as shown in Listing 5.

create table company (

  company_id integer,

  name varchar


create table pieces (

  company_id integer,

  piece_id integer,

  piece_name varchar,

  start_date varchar,

  end_date varchar


create table piece_actors (

  company_id integer,

  piece_id integer,

  p_actor_name varchar


create table actors (

  company_id integer,

  actor_name varchar,

  actor_bio varchar


create view company_w as

select * from


  left outer join pieces // dimension 1

    on company.company_id = pieces.company_id

  left outer join piece_actors // nesting in dimension 1

    on piece.company_id = piece_actors.company_id

    and piece.piece_id = piece_actors.piece_id

  left outer join actors // dimension 2

    on company.company_id = actors.company_id

Listing 5. Construction of the nested records by joining the flat tables.

This glances over the naming of the fields in the flat “select *” but if the field names used in the join conditions are named the same in each use, and the rest of the field names are named uniquely, the combination is straightforward and easy to understand.

Having created the join result in Listing 5, the records can be packed into the nested form according to the rules in [Report].

Since the join was done by two dimensions (one went towards actors, another towards pieces and piece_actors), the packing has to be done along two dimensions as well. [Report] discusses only the one-dimensional packing. However the same principle can be easily applied to pack multiple dimensions.

The [Report] cannot tackle more than one dimension because in this situation it can’t decide on the grouping of the fields.

But looking at the nested data as the join results, the dimensionality of the nested data directly translates to the dimensionality of a join, and the other way around, which is a known relational feature. Thus the join provides the guidance to how to do the grouping. And obviously the different subsets of fields have to be grouped separately, according to the join condition that brought them from their flat record. Figure 1 shows the diagram of the grouping.

Fields of company
+-- Fields of pieces (dimension 1)
|     +-- Fields of piece_actor
+-- Fields of actors (dimension 2)
Figure 1. A nested record as constructed from the flat records by a join.

Note that the purely relational results of the join would have been hugely redundant and next to impossible to process because both dimensions are 1:M joins, and even more so because the dimension by pieces goes two levels deep. But having been packed into a nested data structure, they are compact and easily manageable. The nested records provide a convenient and compact representation for the results of the joins of this kind.

The particular branches of the nested structures can be chosen for traversing if the query makes reference only to these branches. For example, for the query shown in Listing 6, there is obviously no need to traverse down the dimension by pieces (shows in Figure 1 as “dimension 1”), only the dimension by actors (“dimension 2”) is needed.

select company_name, actor_name, actor_bio

from company_w;

Listing 6. An example of a query from the joined view.

In the terms of the nested data structures from Listing 4, the same query from Listing 6 will become as shown in Listing 7. Or the whole packed sub-record on actors can be selected by an even simpler query shown in Listing 8.

select name,,

from company;

Listing 7. An example of a query from a nested data structure.

select name, actors

from company;
Listing 8. An example of a query selecting the whole sub-record from a nested data structure.

Optimization of other joins with the nested data

What if we want to select the bios of all the actors in a particular piece, using the cross-reference in piece_actors? Then of course the original join represented in the nested structure is of not much use. Another join would be needed, like the one shown in Listing 9.

select actors.actor_name, actors.actor_bio from


  left outer join pieces

    on company.company_id = pieces.company_id

  left outer join piece_actors

    on piece.company_id = piece_actors.company_id

    and piece.piece_id = piece_actors.piece_id

  left outer join actors

    on piece_actors.company_id = actors.company_id

    and piece_actors.p_actor_name = actors.actor_name


  company.company_name = COMPANY1

  and piece.piece_name = PIECE1


Listing 9. Join using the cross-reference.

The nested structure is of not much help here by itself. An index would be needed to join the piece_actors and actors efficiently. But it can still provide the benefit of performance. To understand it, let's look again at [Dremel].

Dremel can perform the large queries fast by partitioning the data to great many machines. It can do this because each nested record is independent, and thus a very fine-grained partitioning is easily possible.  Even an aggregation can be similarly partitioned. However Dremel has a complication with the joins. Dremel can easily join a large table to a small table by partitioning the large table and copying the small table to accompany each partition to its processing machine. But it just plain can't join two big tables, because of the combinatorial complexity.

However in many cases an optimization is possible. Suppose that we have the information about a huge number of the theatrical companies in a table with the nested record, and this table is big. Since the nesting cannot be used directly, the selection of actor biographies by piece would require a self-join shown in Listing 10. It's a join of a large table to another large table, so Dremel could not handle it.

select c2.actors from

  company c1

  join company c2

    on =

    and c1.piece.actor_name =

where = COMPANY1

  and = PIECE1


Listing 10. Join on nested records using the cross-reference.

But now let's look again at the nested records as representations of the results of a join in Listing 5. Note that the original join company_w already guarantees that the information about actors in a piece and the information about actors both belong to the same company. So when we look for the actors we don't need to look for them in the whole table. We need to look for them only in the same nested record. This brings back the locality and easy partitioning. The query can be written in the pseudo-code form shown in Listing 11. The “scope self join” there means that not the table company is joined with itself but one nested record from the table company is joined with the same nested record.

select c2.actors from

  company scope self join with aliases (c1, c2)

    on c1.piece.actor_name =

where = COMPANY1

  and = PIECE1


Listing 11. Join on nested records using the cross-reference and the data locality.

It can even be made more efficient by using indexes. Dremel is a columnar DBMS, so it keeps the indexes global and can't help this situation much. But a normal record-based DBMS can keep a separate index in each nested record and use it for these self-joins.

When the nested records are not too large, this record-local index doesn’t have to be kept on disk, instead it can be built in memory when the record is loaded into memory: the actors dimension can be iterated to build the index in memory, and then the piece_actors dimension can be iterated to build the join using that index. Of course, if the records are very small then the indexes would not provide any advantage.

Even without the massive partitioning, the self-joins within the same nested record remove the need to search through the whole table and thus greatly improve the locality of access. The self-joins could also be performed within a sub-hierarchy of a nested record.

Optimization of the aggregations with the nested records

The aggregations can also be often conveniently limited to a nested record or its sub-hierarchy. It's equivalent to the key of that record or hierarchy being used as an aggregation key or its part. But it allows an extra optimization since all the data is already grouped locally

This condition guarantees that the nested record or its sub-hierarchy translates to one or more aggregation groups, and no other record contributes to these groups. So there is no need to keep the group state through the whole query, the group can be aggregated and its result produced by iteration through one nested record, after which the state of that group can be purged.

Related technologies

The MS SQL already contains the APPLY clause that uses a similar but only partial approach, which serves as a validation of the usefulness of the concept. The results of an APPLY can potentially be fed directly into the NEST operator as described in [Nested].

However APPLY has a limitation, very much the same as the limitation of the nested model described in [Nested]: the SELECT with APPLY produces the flat relational records, and thus can efficiently handle only one APPLY per SELECT (unless all the APPLYs but one are 1:1). Having multiple APPLY clauses with 1:M relation would cause a combinatorial explosion of every possible combination. In some cases this might be the desired result but probably not typically.

On the other hand, a nested record with multiple dimensions can represent this kind of result set in a compact and efficient form. And then this compact result can be either transmitted out or used for some following operations. For example, if aiming for a massively parallel processing, the first step can group the relevant data into the nested records that contain all the relevant information, and the second step can farm out these records to thousands of machines processing them without the need for the external references.

The importance of nested records for the conventional RDBMS

To reiterate, this is not a proposal for a way of doing ORM that translates from objects to flat tables, and is not in any way related to ORM. It's not about the Xpath/XSLT kind of querying either. It’s a proposal of a way to extend the relational principles to the nested data structures, and thus make the nested data structures the first-class citizen in an RDBMS.

As described in the previous section, a nested record with multiple dimensions can represent the result set of a multi-dimensional join in a compact and efficient form.

To see why the grouping is important for the massively parallel processing, it’s useful to look again at Dremel's limitation in processing of the joins: it can join two tables only if one of them is small. The reason for that is that handling a query on a large table without a join can be parallelized easily: if you have 1000 machines, you can split the source table into 1000 partitions and have each machine work on its partition in parallel, then combine the results. But if this table gets joined to another table, each of these 1000 machines must have access to the whole second table. That's easy if the second table is small (the access to it from 1000 machines can then be efficiently cached) but if the second table is big then the access becomes much more expensive. However if the data happens to have some special locality properties where the mutually relevant data from both tables can be combined into a single message then these messages can again be efficiently spread out to 1000 machines.

Thus having the nested records treated as first-class citizen in the relational databases can make this kind of operations easier, letting the relational databases include the advantages of the hierarchical databases. "First-class" means that not only the tables would be able to contain the nested records but the SELECTs should be able to produce the nested records as results.

It would preserve all the relational logic unchanged, since on the logical level the nested records would be expanded into the flat classic form, operations performed, and the results packed back into the nested form. Of course, the actual execution would avoid doing the expensive intermediate steps of expansion directly, instead they can be optimized out, producing the same result in a more efficient way.

Adapting a conventional RDMBS for nested records

As shown above, taking advantage of the nested records in a common record-oriented RDBMS looks fairly straightforward. The following summarizes the changes:

  1. Support for the nested record format. It must be supported not only in the input and tables but also in the output, in the query results and in the intermediate values. This includes the creation of the nested record types on the fly.
  2. Support for the record-local indexes, to be stored along with the records and to be used in the in-record self-joins.
  3. The full-table indexes should probably locate the data with the precision of a nested record, with the further search done by the in-record index.
  4. Syntactic support in the query language for the scope-limited self joins (in a nested record, or in its sub-node).
  5. The scope-limited joins don't have to be self-joins with the same nested record on both sides. In some cases it might be useful to find the second nested record and then perform the cross-product of the fields A from record 1 and fields B from the record 2.
  6. Each node of a nested record can be seen as one side of join in the formation of the nested-record-as-a-join-result. Being able to refer to all the sub-nodes growing from that node would be highly useful when writing a join. Thus it would be highly useful to generate a pseudo-field similar to rowid for each node, thus making it a unique key of this node.

From the data storage standpoint, there probably is not much use in trying to modify the nested records in place. In reality the nested records tend to be of the more reasonable size, so when a nested record is updated, the whole old record can be replaced with a new one. The same approach can be applied to the record locking, locking the whole nested record. If the nested records grow over a certain limit, they can be stored in fragments by sub-trees.

Another possible approach is the automatic disassembly of the nested records into their flat “underlying tables” by reversing the join. Then they can be restored back by repeating the join.

Yet another approach may create the nested records automatically from the flat tables by joining them and storing the results as nested. This operation is similar to the creation of the materialized views, only with a different storage format. This can improve the efficiency of the data access in certain cases. The original tables may then be kept or become virtual, since their data can always be extracted back from the nested format.


The relational database systems can take a great advantage of the nested records by treating them as join results, as well as of producing the results of the common joins in the nested format, and of “un-joining” the nested records in querying. This support should be reasonably easy to add to the existing RDBMS with the reasonably small extensions.


[Dremel]  Dremel: Interactive Analysis
of Web-Scale Datasets
[NoGood]  The Nested Relational Data Model is Not a Good Idea
[Report] A Recursive Algebra for Nested Relations