Friday, December 12, 2014

Relational Operations on the Nested Data

I wrote this little whitepaper that I don't know where to put, so I'll just put it here. It gives some interesting ideas for the future development of Triceps. The PDF version can also be found at http://web.newsguy.com/sab123/papers/rel_nested.pdf

It might also help to read first my explanation of why the nested data is important and how it differs from the Object Relations Managers at http://babkin-cep.blogspot.com/2014/12/nested-record-and-orm.html.


Relational Operations on the Nested Data

Sergey A. Babkin

Abstract

This paper shows a way to integrate the RDBMS and the nested records to the mutual benefit. The nested records dovetail with the relational model to cover the weaknesses of the later, such as the representation of the multi-dimensional 1:M joins, and allow an optimization of other kinds of jons.

Summary

The use of the nested data structures seems to be growing in popularity for the distributed data processing: Google has Protobufs, Hadoop and Bing have their own formats, XML has been around for a long time, and so on. Among other things, they are used as records in the non-relational and kind-of-relational DBMS. While at Google, I've used the Dremel system that implements a read-only columnar database over the tables of nested records in the protobuf format, and while it was convenient in some ways, in the other ways the operations in it have felt unnecessarily difficult. To be honest, by now I can't even formulate why exactly it felt so: it's been a while since I've used Dremel, and by now I forgot the details, I have to guess from the logical considerations in the published description in [Dremel] (http://research.google.com/pubs/pub37217.html). The theory behind it is described very nicely in [Nested] (http://peqnp.blogspot.com/2013/01/nested-relational-algebra.html) and in the documents it refers to. It was surprising to find out that their model is based on a paper from 1988 [Report] (http://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR259), and apparently nothing much had happened since then. Everything I've found on the subject seems to be based on [Report].

In this paper I want to show that:
1.       There is more to the nested data structures than described in [Report].
2.       There is a straightforward way to map the nested data to the relational operations and take advantage of its locality.
3.       A common record-based (as opposed to columnar, such as Dremel) RDBMS can be adapted to take advantage of the nested records with a small set of changes.

To forestall a typical question, this is not a proposal for a way of doing ORM that translates from objects to flat tables, and is not in any way related to ORM. It is about the adoption of the nested records into the relational theory.

Nested data is multi-dimensional

The [Report] deals with the nesting but all the nesting discussed there is one-dimensional. The schema of an example from there can be written out in a pseudo-code similar to Google protobufs in Listing 1. Even though the nesting is two-deep, it goes in only one direction, growing in one dimension.

struct client {

  string name;

  string address;

  repeated struct company {

    string name;

    repeated struct share {

      float price;

      string date;

      int quantity;

    } shares;

  } companies;

};
Listing 1. One-dimensional nested data structure.

But the real-world nested records tend to have multiple repeated elements. For example, consider a structure describing a theatrical company. It might list the pieces produced by this company in the current season and the actors in the company, as shown in Listing 2. This structure has two separate repeating fields going in parallel at the same level, expanding in two dimensions.

struct company {

  string name;

  repeated struct piece { // dimension 1

    string name;

    string start_date;

    string end_date;

  } pieces;

  repeated struct actor { // dimension 2

    string name;

    string bio;

  } actors;

};
Listing 2. Two-dimensional nested data structure.

For another example, the [Dremel] paper describes a structure shown in Listing 3.

struct  Document {

  int64 DocId;

  struct {

    repeated int64 Backward; // dimension 1

    repeated int64 Forward; // dimension 2

  } Links;

  repeated struct { // dimension 3

    repeated struct Language {

      string Code;

      string Country;

    }

    string Url;

  } Names;

};
Listing 3. Three-dimensional nested data structure from [Dremel] paper.

This example has four repeated elements in three dimensions, one of the dimensions nesting two-deep.

Or there might be further cross-references in the nested data, for example, the theatrical company description may have a list of actors for each piece, as show in Listing 4. Rather than repeating actor's biography in each piece, it can be found by name, descending down the other dimension. Note that the field piece. actor_name is also repeated, creating a nesting of two levels deep.

struct company {

  string name;

  repeated struct piece { // dimension 1

    string name;

    string start_date;

    string end_date;

    repeated string actor_name; // cross-reference to dimension 2

  } pieces;

  repeated struct actor { // dimension 2

    string name;

    string bio;

  } actors;

};
Listing 4. Two-dimensional nested data with nested cross-references.

The [Report] doesn't provide a solution for the multi-dimensional data. There even are papers like [NoGood] (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89.2966&rep=rep1&type=pdf )that argue that the nested structures are not a good idea to start with, citing in particular the cross-references like shown above between the pieces and the actors. But with a little different approach, the nested structures are quite useful.


Nested data is the result of a Join


People argue everywhere, including Wikipedia, that the nested data saves the need for joins. And it looks obvious, yet I haven't seen this fact stated before in another form, looking from a different standpoint:

A nested structure can be thought of as the result of a join.

The abovementioned theatrical company data from Listing 4 can be represented in the pure relational way as multiple joined tables (shown here in an SQL-like pseudo-code), as shown in Listing 5.


create table company (

  company_id integer,

  name varchar

);



create table pieces (

  company_id integer,

  piece_id integer,

  piece_name varchar,

  start_date varchar,

  end_date varchar

);



create table piece_actors (

  company_id integer,

  piece_id integer,

  p_actor_name varchar

);



create table actors (

  company_id integer,

  actor_name varchar,

  actor_bio varchar

);



create view company_w as

select * from

  company

  left outer join pieces // dimension 1

    on company.company_id = pieces.company_id

  left outer join piece_actors // nesting in dimension 1

    on piece.company_id = piece_actors.company_id

    and piece.piece_id = piece_actors.piece_id

  left outer join actors // dimension 2

    on company.company_id = actors.company_id

;
Listing 5. Construction of the nested records by joining the flat tables.

This glances over the naming of the fields in the flat “select *” but if the field names used in the join conditions are named the same in each use, and the rest of the field names are named uniquely, the combination is straightforward and easy to understand.

Having created the join result in Listing 5, the records can be packed into the nested form according to the rules in [Report].

Since the join was done by two dimensions (one went towards actors, another towards pieces and piece_actors), the packing has to be done along two dimensions as well. [Report] discusses only the one-dimensional packing. However the same principle can be easily applied to pack multiple dimensions.

The [Report] cannot tackle more than one dimension because in this situation it can’t decide on the grouping of the fields.

But looking at the nested data as the join results, the dimensionality of the nested data directly translates to the dimensionality of a join, and the other way around, which is a known relational feature. Thus the join provides the guidance to how to do the grouping. And obviously the different subsets of fields have to be grouped separately, according to the join condition that brought them from their flat record. Figure 1 shows the diagram of the grouping.

Fields of company
+-- Fields of pieces (dimension 1)
|     +-- Fields of piece_actor
+-- Fields of actors (dimension 2)
Figure 1. A nested record as constructed from the flat records by a join.

Note that the purely relational results of the join would have been hugely redundant and next to impossible to process because both dimensions are 1:M joins, and even more so because the dimension by pieces goes two levels deep. But having been packed into a nested data structure, they are compact and easily manageable. The nested records provide a convenient and compact representation for the results of the joins of this kind.

The particular branches of the nested structures can be chosen for traversing if the query makes reference only to these branches. For example, for the query shown in Listing 6, there is obviously no need to traverse down the dimension by pieces (shows in Figure 1 as “dimension 1”), only the dimension by actors (“dimension 2”) is needed.


select company_name, actor_name, actor_bio

from company_w;

Listing 6. An example of a query from the joined view.

In the terms of the nested data structures from Listing 4, the same query from Listing 6 will become as shown in Listing 7. Or the whole packed sub-record on actors can be selected by an even simpler query shown in Listing 8.

select name, actors.name, actors.bio

from company;

Listing 7. An example of a query from a nested data structure.

select name, actors

from company;
Listing 8. An example of a query selecting the whole sub-record from a nested data structure.


Optimization of other joins with the nested data

What if we want to select the bios of all the actors in a particular piece, using the cross-reference in piece_actors? Then of course the original join represented in the nested structure is of not much use. Another join would be needed, like the one shown in Listing 9.

select actors.actor_name, actors.actor_bio from

  company

  left outer join pieces

    on company.company_id = pieces.company_id

  left outer join piece_actors

    on piece.company_id = piece_actors.company_id

    and piece.piece_id = piece_actors.piece_id

  left outer join actors

    on piece_actors.company_id = actors.company_id

    and piece_actors.p_actor_name = actors.actor_name

where

  company.company_name = COMPANY1

  and piece.piece_name = PIECE1

;

Listing 9. Join using the cross-reference.

The nested structure is of not much help here by itself. An index would be needed to join the piece_actors and actors efficiently. But it can still provide the benefit of performance. To understand it, let's look again at [Dremel].

Dremel can perform the large queries fast by partitioning the data to great many machines. It can do this because each nested record is independent, and thus a very fine-grained partitioning is easily possible.  Even an aggregation can be similarly partitioned. However Dremel has a complication with the joins. Dremel can easily join a large table to a small table by partitioning the large table and copying the small table to accompany each partition to its processing machine. But it just plain can't join two big tables, because of the combinatorial complexity.

However in many cases an optimization is possible. Suppose that we have the information about a huge number of the theatrical companies in a table with the nested record, and this table is big. Since the nesting cannot be used directly, the selection of actor biographies by piece would require a self-join shown in Listing 10. It's a join of a large table to another large table, so Dremel could not handle it.


select c2.actors from

  company c1

  join company c2

    on c1.name = c2.name

    and c1.piece.actor_name = c2.actors.name

where

  c1.name = COMPANY1

  and c1.piece.name = PIECE1

;

Listing 10. Join on nested records using the cross-reference.

But now let's look again at the nested records as representations of the results of a join in Listing 5. Note that the original join company_w already guarantees that the information about actors in a piece and the information about actors both belong to the same company. So when we look for the actors we don't need to look for them in the whole table. We need to look for them only in the same nested record. This brings back the locality and easy partitioning. The query can be written in the pseudo-code form shown in Listing 11. The “scope self join” there means that not the table company is joined with itself but one nested record from the table company is joined with the same nested record.

select c2.actors from

  company scope self join with aliases (c1, c2)

    on c1.piece.actor_name = c2.actors.name

where

  c1.name = COMPANY1

  and c1.piece.name = PIECE1

;

Listing 11. Join on nested records using the cross-reference and the data locality.

It can even be made more efficient by using indexes. Dremel is a columnar DBMS, so it keeps the indexes global and can't help this situation much. But a normal record-based DBMS can keep a separate index in each nested record and use it for these self-joins.

When the nested records are not too large, this record-local index doesn’t have to be kept on disk, instead it can be built in memory when the record is loaded into memory: the actors dimension can be iterated to build the index in memory, and then the piece_actors dimension can be iterated to build the join using that index. Of course, if the records are very small then the indexes would not provide any advantage.

Even without the massive partitioning, the self-joins within the same nested record remove the need to search through the whole table and thus greatly improve the locality of access. The self-joins could also be performed within a sub-hierarchy of a nested record.

Optimization of the aggregations with the nested records

The aggregations can also be often conveniently limited to a nested record or its sub-hierarchy. It's equivalent to the key of that record or hierarchy being used as an aggregation key or its part. But it allows an extra optimization since all the data is already grouped locally

This condition guarantees that the nested record or its sub-hierarchy translates to one or more aggregation groups, and no other record contributes to these groups. So there is no need to keep the group state through the whole query, the group can be aggregated and its result produced by iteration through one nested record, after which the state of that group can be purged.

Related technologies

The MS SQL already contains the APPLY clause that uses a similar but only partial approach, which serves as a validation of the usefulness of the concept. The results of an APPLY can potentially be fed directly into the NEST operator as described in [Nested].

However APPLY has a limitation, very much the same as the limitation of the nested model described in [Nested]: the SELECT with APPLY produces the flat relational records, and thus can efficiently handle only one APPLY per SELECT (unless all the APPLYs but one are 1:1). Having multiple APPLY clauses with 1:M relation would cause a combinatorial explosion of every possible combination. In some cases this might be the desired result but probably not typically.

On the other hand, a nested record with multiple dimensions can represent this kind of result set in a compact and efficient form. And then this compact result can be either transmitted out or used for some following operations. For example, if aiming for a massively parallel processing, the first step can group the relevant data into the nested records that contain all the relevant information, and the second step can farm out these records to thousands of machines processing them without the need for the external references.

The importance of nested records for the conventional RDBMS

To reiterate, this is not a proposal for a way of doing ORM that translates from objects to flat tables, and is not in any way related to ORM. It's not about the Xpath/XSLT kind of querying either. It’s a proposal of a way to extend the relational principles to the nested data structures, and thus make the nested data structures the first-class citizen in an RDBMS.

As described in the previous section, a nested record with multiple dimensions can represent the result set of a multi-dimensional join in a compact and efficient form.

To see why the grouping is important for the massively parallel processing, it’s useful to look again at Dremel's limitation in processing of the joins: it can join two tables only if one of them is small. The reason for that is that handling a query on a large table without a join can be parallelized easily: if you have 1000 machines, you can split the source table into 1000 partitions and have each machine work on its partition in parallel, then combine the results. But if this table gets joined to another table, each of these 1000 machines must have access to the whole second table. That's easy if the second table is small (the access to it from 1000 machines can then be efficiently cached) but if the second table is big then the access becomes much more expensive. However if the data happens to have some special locality properties where the mutually relevant data from both tables can be combined into a single message then these messages can again be efficiently spread out to 1000 machines.

Thus having the nested records treated as first-class citizen in the relational databases can make this kind of operations easier, letting the relational databases include the advantages of the hierarchical databases. "First-class" means that not only the tables would be able to contain the nested records but the SELECTs should be able to produce the nested records as results.

It would preserve all the relational logic unchanged, since on the logical level the nested records would be expanded into the flat classic form, operations performed, and the results packed back into the nested form. Of course, the actual execution would avoid doing the expensive intermediate steps of expansion directly, instead they can be optimized out, producing the same result in a more efficient way.

Adapting a conventional RDMBS for nested records

As shown above, taking advantage of the nested records in a common record-oriented RDBMS looks fairly straightforward. The following summarizes the changes:

  1. Support for the nested record format. It must be supported not only in the input and tables but also in the output, in the query results and in the intermediate values. This includes the creation of the nested record types on the fly.
  2. Support for the record-local indexes, to be stored along with the records and to be used in the in-record self-joins.
  3. The full-table indexes should probably locate the data with the precision of a nested record, with the further search done by the in-record index.
  4. Syntactic support in the query language for the scope-limited self joins (in a nested record, or in its sub-node).
  5. The scope-limited joins don't have to be self-joins with the same nested record on both sides. In some cases it might be useful to find the second nested record and then perform the cross-product of the fields A from record 1 and fields B from the record 2.
  6. Each node of a nested record can be seen as one side of join in the formation of the nested-record-as-a-join-result. Being able to refer to all the sub-nodes growing from that node would be highly useful when writing a join. Thus it would be highly useful to generate a pseudo-field similar to rowid for each node, thus making it a unique key of this node.

From the data storage standpoint, there probably is not much use in trying to modify the nested records in place. In reality the nested records tend to be of the more reasonable size, so when a nested record is updated, the whole old record can be replaced with a new one. The same approach can be applied to the record locking, locking the whole nested record. If the nested records grow over a certain limit, they can be stored in fragments by sub-trees.

Another possible approach is the automatic disassembly of the nested records into their flat “underlying tables” by reversing the join. Then they can be restored back by repeating the join.

Yet another approach may create the nested records automatically from the flat tables by joining them and storing the results as nested. This operation is similar to the creation of the materialized views, only with a different storage format. This can improve the efficiency of the data access in certain cases. The original tables may then be kept or become virtual, since their data can always be extracted back from the nested format.

Conclusion

The relational database systems can take a great advantage of the nested records by treating them as join results, as well as of producing the results of the common joins in the nested format, and of “un-joining” the nested records in querying. This support should be reasonably easy to add to the existing RDBMS with the reasonably small extensions.

References

[Dremel] http://research.google.com/pubs/pub37217.html  Dremel: Interactive Analysis
of Web-Scale Datasets
[NoGood] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89.2966&rep=rep1&type=pdf  The Nested Relational Data Model is Not a Good Idea
[Report] http://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR259 A Recursive Algebra for Nested Relations


Thursday, October 30, 2014

another blog

I've been pretty busy recently with my daily job, so not much was happening in Triceps. In the meantime, I've started a blog for the work-related things:

http://blogs.msdn.com/b/sergey_babkins_blog/


Thursday, September 11, 2014

docs on Kindle

Wow, it turns out, about 5 people have spent a dollar for the convenience to get the Triceps manual on their Kindle. When I've finished the manual for version 2, I didn't bother to export it into the Kindle format because it looked like nobody got it from there anyway. I will now.

Tuesday, July 29, 2014

compact vs detailed

A few days ago I've been editing the Developer's Guide chapter on TQL when an idea had occurred to me:

Perhaps, what makes the typical CEP model so hard to read is that they try to represent both the topology and all the details in one go. Yes, I'm a fan of reading the program text sequentially, and I've been saying that the problem with SQL is that it's written backwards: each statement specifies the results first and then the computation; if the results went after computation, the whole flow would be much easier to read sequentially. But it still doesn't solve the problem of branching, and maybe doesn't solve the fundamental difficulty at all.

Maybe the right way to write the CEP programs is to write out the pipelines in a short notation, and then define the elements of these pipelines in all detail separately. So the main program might look something like:

inputTrades | checkTrades -> storeTrades
inputSymbols -> storeSymbols
(storeTrades, storeSymbols) | enrichTrades | outputData

and then each of these stages can be defined in detail separately. No arguments, no details at all in the pipeline  specification. All the details would go into the detailed definitions. So each module would consist of these two parts: the pipeline outline and the detailed definitions of the pipeline elements. And then these modules can be combined into the higher-level modules.

Though it's still not clear, how to show say the maintenance with the expiration of the data in this notation? Would it make sense in the more complicated cases?

Monday, July 28, 2014

Triceps 2.0.0 released

It's a grand day for Triceps: the version 2.0 is finally released. There aren't many code changes since the last snapshot but the documentation has been collected and edited together. The documentation has been a huge work that took over a year.

As a reminder, the code features of 2.0 include such things as the multithreading support, the streaming functions, TQL, and great many small improvements.

Sunday, July 20, 2014

a minor TQL clean-up

I've been working on the documentation chapter about TQL, and noticed that its single-threaded version has not been updated to use the query name as the label in the response. So I've fixed it. The previous output like

query,{read table tSymbol}
lb1read OP_INSERT symbol="AAA" name="Absolute Auto Analytics Inc" eps="0.5"
+EOD,OP_NOP,lb1read

now becomes

query,{read table tSymbol}
query OP_INSERT symbol="AAA" name="Absolute Auto Analytics Inc" eps="0.5"
+EOD,OP_NOP,query

I've also added the optional argument for name override in SimpleServer::makeServerOutLabel().

Sunday, July 13, 2014

monads

I've recently been bullshitting onna internets with some fans of the functional programming, and the said fans are all ablaze about the monads. Yet they have a very hard time explaining what's the benefit of the monads, and even what these monads are. So I've got around and read up about this stuff, and bullshitted some more, and here it is: the short explanation of the monads in programming in all their glory. This is the third version of it that I've figured out, and even though it might happen that I'd have more, I think that I've got to the bottom of it this time. The first version of my understanding was gained from Wikipedia, and as it always happens with Wikipedia,  it turned out to be crap. Not all crap, only half-crap, but still.

If you're interested in the benefits, let me tell it up front: they seem to be that the glorious community of functional programming can now honestly use the loops and imbibe all their goodness with break and continue statements, and even exceptions, without losing the cherry of the strict mathematical theory. For the rest of us, nothing new.

Now, to the details. What the monads are really about is the iteration on containers, and nesting these iterations. For use in a monad, a container of objects has to have a couple of properties.

First, if you have a container, you must be able to iterate over it and create another container with exactly the same structure (i.e. if you can find an element in the first container in some way, you must be able to find an element in the second container in the same way) but different data (difference may be both in the value and the type of the data). The data is supposed to  be the result of some operation of the data from the first container at the same position but the concept of operation is quite wide-open.

This property means that if your container is a list, you need to know how to create another list of the same number of elements, if it's an associative array, you need to know how to create another associative array with the same keys, if your container is a tree, you need to know how to create another tree with the exact same structure.

Second, if you have two containers, you need to have some rule on how to "combine" them into the third container of the same type. The structure of that third container can be any, as long as it follows some consistent rule, and the data elements represent the results of some operation on one element from the first container and one element from the second container. For example, for the lists you can pick the rule that concatenates them, or produces every possible pair of elements from the first list and the second list, or combines the elements at the same position in both lists. Just as in the first property, the data placed in the result would be the result of some operation of the data from the two incoming containers.

What the monad does is apply that combination rule over a sequence of containers, combining it into one. The first two get combined together, then that one is combined with the third one, and so on. The monad is really a bunch of nested loops.  In the texts about monads they like to discuss the examples monads on Haskell kinds of containers, like lists, and Maybe (a value that may be NULL), and IO (a strange container that performs input/output), but for all I can tell, the most entertaining and straightforward container would be an associative array, or as its variety, a database table.

For example, the following Perl loop represents a monad (remember, the data in the result may be produced by any operation on the data extracted from the argument container).

my %result;
my %a = ...; my %b = ...; my %c = ...;

while (($ka, $va) = each %a) {
  while (($kb, $vb) = each %b) {
    while (($kc, $vc) = each %c) {
      $result{"$ka,$kb,$kc"} = printf("%d" , $vc? $va: $vb);
    }
  }
}

And if A, B and C were tables, that would be an unconditional join. The loops don't have to go over the whole container, they may even pick only one or even zero elements, and if such is the rule, they may as well be ifs:

while (($ka, $va) = each %a) {
  if (exists $b{$ka} ) {
    if (exists $c{$ka} ) {
      $result{"$ka"} = [ $va, $b{$ka}, $c{$ka} ];
    }

  }

}

And it's also a monad! If these were tables, it would be a left join.

The first property of the container lets you do a loop without nesting, and the second property allows to do any number of nestings by applying it recursively.


The rules as stated above say that each nesting must follow the same rule of iteration on the same container type. But there is a weaseling way around it: we can always say that the container is a union that also contains the indication of what operation needs to be done with it, and then the rule that combines two containers would look at this indication in the argument containers and act accordingly. The type Maybe from Haskell does exactly this (and a monad on it provides the Haskell version of what you can think of as exceptions or break/continue). We can also weasel further and say that oh, we know this indication hardcoded at the compilation time, so instead of putting it into the container and then extracting it, we'll just hardcode the loop appropriately (and yes, Haskell does this). The end result is that ANY nested loops can form a monad. And since the if is a special case of a loop, and the unconditional execution is a special case of an if, any procedural code can be a monad.


Not everything would be a monad, but guess what, you can nest monads in monads, and that would give everything. And the returned container doesn't have to be anything like any of the containers involved in the loop, it's the same weasely thing.


The only thing that is left separating the monads from the usual procedural code is that all the data defined outside the loops is read-only (the local variables in the loop blocks can be read-write, this also can be explained in the weasely ways, however the variables outside the current block also can't be changed), and the only result produced is the returned container. However the result can contain a description of how the outside data needs to be modified, and the caller of the monad is then free to apply that description. It's just the ass-backwards ways of the functional programming. So with enough nested and recursive monads (not just nested loops but nested monads) and this apply-description business, all this monkeying can properly simulate the procedural programming.

That description thing can also be though of as the redo log of a transaction: first the redo log is collected and then the whole transaction is applied. Obviously, for the operations inside the transaction to see the state as modified by the previous operations in the transaction, they would have to dig through the redo log.


The definition of the container is also pretty wide. Containers may among other things be infinite - "for (;;)" or "for (i=0;; i++)"  are examples of iteration over an infinite container.


This brings us to the point of how the monads are related to CEP.  Each nesting level in the monad gets an item of key and data from the previous level, uses it somehow to select zero or more values for the container it has, and sends to the next nesting level a newly combined item of key and data information for every value selected. It's a pipeline! The pipeline approach was actually the second version of my understanding of the monads.


What kind of a pipeline it is? It's a straight one, no branching, nor god forbid, joining. And it's a read-only one: each stage of the pipeline contains no modifiable state, it's all read-only. All the updates have to be carried in the data traveling through the pipeline. And each stage of the pipeline essentially sends out a copy of its incoming data with its own section added to it.


Quite boring and useless. By the way, note that the join as described before would work only on the static tables, not on streams.

Now we get a bit off the monad subject as such and get into the depths of Haskell's dealing with pipelines.

Of course there are weaseling ways around the boring limitations. First, each stage of the pipeline may contain the nested pipelines (i.e. nested monads), and may decide, which of the nested pipelines to take. That gives the fork-join topology (but as you may notice no "fork-and-no-join"). Second, these nested pipelines can be nested further to infinity. Third, the input point itself is a data object that can be passed through the pipeline.


This third item is something that takes some time getting the mind around. Basically, the first stage in the pipeline says "I'm not going to get the input from the outside world any more, instead I'm passing this functionality to the second stage". Obviously, after that the first stage becomes dead and can as well be discarded. Well, not always: there may be MULTIPLE data input points, either within one stage (remember, they are just data, and can be held together), or sprinkled throughout the pipeline. The first stage will become dead only after it gets rid of all its inputs. But let's just say for the sake of an example that there is one input point.


So, with all this weaseling a Haskell program simulates the changing states with two pipelines (i.e. monads):


The pipeline A implements the computation of the next application state.


The pipeline B consists of two stages:
1. Read one item of data, call pipeline A, collect the result form it, and send its result along with the input point to the stage 2.
2. Instantiate a nested copy of the pipeline B,  then send the input point and the state to that pipeline, and let it do its work.


It's an infinite number of pipelines, created as long at there is more data coming. And being the tail recursion, Haskell optimizes it out by replacing the second stage of pipeline B with a new copy of pipeline B, instead of nesting it. It discards the old stage 1 after it becomes dead. This really becomes a dynamic pipeline, endlessly growing at the back and dying out at the front. Of course, when things get to the actual running, that gets optimized out too. Instead of creating a new pipeline they just turn back to the old pipeline, say "now it's the new pipeline", give it the new state and let it run along.


As far as the parallelism is concerned, the stages of the pipeline A can work in parallel, but then when the time comes to apply the result and get the next input, it hiccups and becomes single-threaded.


Is there an important lesson in this? I don't see any, at least yet. Other than an interesting mental exercise in the ass-backwardness. And from all the asking of the monad enthusiasts on the internets, what exactly is the benefit they get from the monads, I didn't get any answer other than the hand-waving about the Mathematical Foundations. For all I can tell, there isn't one.

If you're interested in learning more about Haskell and its monads, I can recommend http://learnyouahaskell.com/

Saturday, June 21, 2014

status update

In case if you wonder, Triceps is still alive, and I'm getting close to finishing the documentation for 2.0. There is only about 30 posts left to fold in. But unfortunately I've been very busy recently, so this keeps dragging on and on.

Saturday, May 3, 2014

shared_ptr-2

I've figured out, what was my problem with shared_ptr. The problems there start with using reset(pte) to store a newly constructed object in an existing shared_ptr. Which provokes into using reset(ptr) for storing any pointers. The solution is to never use reset(ptr). Instead either always turn the pointers into shared_ptr first, then assign them. Such as one of:

something->shared_ = make_shared<Object>(args);
something->shared_ = shared_ptr<Object>(new Object(args));

or even make the constructors protected and wrap them into the static methods that would return a shared_ptr.

Which puts shared_ptr close enough to Autoptr. It's still more awkward and less efficient but a potential advantage is that it can handle any object, without the reference counting built into the objects themselves.

A consequence of this approach is that they absolutely need weak_ptr, they can't just have the naked pointers sitting around pointing up the trees, that break sthe whole paradigm. Having the weak references is also convenient if you really need them. But as my experience with Triceps shows, they are generally overrated. Never had to use them for Triceps, and the ptwrap library I use (and wrote) has them!

If used to point up the tree, there is not that much advantage compared to a simple pointer because it's typically easy for the code to just hold that up-the-tree object separately while dealing with it. If used between the peer objects, one of the objects can usually be split into two parts. I.e. If there are objects A and B, B is usually interested not in the whole A but in a part of it. So separate that part into the object C. Then A would refer to B and C, and B would refer to C.

Friday, May 2, 2014

std::shared_ptr

I've tried recently to use the C++11 version of the reference counting, the class std::shared_ptr. And I must say, it's crap. Utter and complete crap.

Their problem is that they are trying to do reference counting to any structures at all, without storing the reference counter in the structure itself. The consequence is that you absolutely can't mix the shared_ptr and pointers. Once you store a pointer in a shared_ptr , you have to refer to it by shared_ptr, and assign the shared_ptr values to each other. If you take a pointer and store it in another shared_ptr, you end up with two reference counters to the same structure, and end up with the memory corruption. And it's way to easy to make this mistake. It's really not usable.

The end result, my Autoref template is way, way, way better and safer to use.

Thursday, May 1, 2014

auto-detecting the file handle class when passing it through the App

I've been editing the documentation for the passing of the file descriptors between threads, and I've realized that there is no need to specify the class of the Perl file handle when it gets loaded from the App. Instead the name of the class can be easily stored along with the file descriptor and then extracted back. So I went and changed the code to do that. The modifications are:

In the App class the method storeFd() takes an extra argument:

$app->storeFd($name, $fd, $className);

The $className specifies the class of the file object. The empty string can be used as a synonym for "IO::Handle", since ref() returns an empty string for the globs of the old-fashioned file handles.

The methods loadFd() and loadDupFd() now return two values:

($fd, $fclass) = $app->loadFd($name);
($fd, $fclass) = $app->loadDupFd($name);

The second returned value is the class name, as it was stored by storeFd().

And the methods App::loadDupSocket(), TrieadOwner::trackDupSocket() and TrieadOwner::trackGetSocket() have been removed. Instead App::loadDupFile(), TrieadOwner::trackDupFile() and TrieadOwner::trackGetFile() have been updated to get the stored file handle class and use it transparently, so they just work correctly for the sockets now.

The methods App::loadDupFileClass(), trieadOwner::trackDupClass() and TrieadOwner::trackGetClass() are still present, in case if you would want to override the class name, but now they should be pretty much never needed, since any class names should be handled automatically without the need for overrides.

And the C++ App class got changed a little bit as well, with the extended versions of storeFd() and loadFd():

void storeFd(const string &name, int fd);
void storeFd(const string &name, int fd, const string &className);
int loadFd(const string &name, string *className = NULL) const;

The new interface is backwards-compatible with the old one but also has the provision for storing and loading the file class name.