Saturday, December 20, 2014

nested record and ORM

Some more thoughts on the subject. First of all, why should anyone care about the nested records?

Nowadays people are used to the relational model. But next to all of this sits Google with its protobufs. It has published the information about protobufs, and about some of its systems that use protobufs but I think there is not enough appreciation of how much the protobufs mean for Google. If you look at the published information, you'll realize that everything at Google, totally everything, is based on protobufs. They're used as the arguments of the RPC calls and as the returned values, the map-reduce is built on them, as well as all kinds of the no-SQL DBMS and a few of SQL DBMS. And there are good reasons for why they're used so much. Well, for the no-SQL databases it's a no-brainer: if all it can do is store the key-value pairs, at least the values must be capable of representing complexity. To understand the other reasons, let's look at the difference between the nested records and the Object Relation Managers.

Entirely coincidentally, I've encountered recently a reference to the article at http://blogs.tedneward.com/2006/06/26/The%2BVietnam%2BOf%2BComputer%2BScience.aspx, on the subject of ORM. It's an interesting viewpoint though the conclusions don't seem quite right to me, as well as a few of the pre-suppositions.

The idea that the object model has the concept of identity and the relational model doesn't, doesn't seem right. In reality the primary keys are used all over the real databases. And the primary keys are exactly that, the identity. Even more so if the primary key is not a part of data but an auto-generated unique value. Here we can go into the discussions about "but it's just a hack, not a part of the theory" but think about this: these auto-generated unique keys are the bread and butter of the data normalization, which certainly is the part of the theory. The point of normalization is to get rid of the duplicate data in the rows and move it to a separate table where it will exist as a single copy, with multiple short references to it from the original table. But so are the references between the objects, multiple objects may refer to the same subordinate object. And the reference to that object's address (its identity) is just the same thing as the references to that unique key in the subordinate table.

So, what ORM does is map one-to-one the objects and records, and maps the references between the objects by address to these references between the records by the unique id values. The joins are then the way to navigate the references between the objects.

The nested records in protobufs (or their equivalents) have a different meaning and are used differently. The whole point of a nested record is that it represents a self-contained unit of data. This data is all stored together because it's normally used together. It's obvious to see when the protobufs are used for RPC but it works the same when the protobufs are used in a database. Obviously, not everything in the world can be covered by this usage and obviously there are downsides. An obvious downside is that a nested record  in a protobuf can't be modified piecemeal: if you need to change it, you have to build a whole new record and put it back in place. And obviously the data is stored by value, not by reference, so the duplicate data has to be duplicated. But for the situations where the self-containment is real important, they work great. As in all the massively-parallel processing at Google. Each nested record has enough information in it to be processed separately from the others. Again, not everything in the world can be expressed like this, but what can, benefits tremendously from the nested records.

Now I think I can formulate, what is the difference between the ORM and the nested records: the ORMs connect the objects by reference while the nested records bunch up the data by value. Different uses, different trade-offs, different implementations. (Though for the in-memory implementations the boundary is more blurred).

And I think I've found the explanation to why they say that the Oracle's nested records are slower than the manual joins, even though they amount to joins, the the ORM-like way. Well, or course it could just be a limitation of the query optimizer: if the optimizer runs first and only then the nesting is expanded, it obviously would not be as optimal. But that should be easy to fix, just expand the nesting and then run the optimizer, so this is unlikely to be the problem. A more fundamental problem seems to be that the records are sent into the database with the data by value, and then they get broken up into the flat tables and stored with connections by reference. It successfully combines the data duplication of by-value and the cost of the extra joins from by-reference. Both kinds of shortcomings rolled together.

Coming back to that point that not everything in the world can be represented as the nested records in protobufs. This means that the references between the tables (or objects, if you prefer to think about them that way) are still needed, and the nested records still need to use them. Which means fitting them into the relational model. And that's where my whitepaper comes in: how to represent the nested records in the relational model. They can be logically dis-joined into the "underlying tables" and the relational operations can be applied to them in the usual way.

Which can also solve problem if the representation of the object inheritance and polymorphism in ORMs, described in the article I listed above. Well, representing the polymorphic data as nested records is fairly straightforward as long as the records contain the indication of subtypes.  And as I've shown in the whitepaper, mapping the nested records to the relational use is also straightforward.  Put them together and you have the solution.

No comments:

Post a Comment