Saturday, December 17, 2011

surveying the landscape

What do we have in the CEP area now? The scene is pretty much dominated by Sybase (combining the former Aleri and Coral8) and StreamBase.

There seem to be two major approaches to the execution model. One was used by Aleri, another by Coral8 and StreamBase. I'm not hugely familiar with StreamBase, but that's how it seems to me. Since I'm much more familiar with Coral8, I'll be calling the second model the Coral8 model. If you find StreamBase substantially different, let me know.

The Aleri idea is to collect and keep all the data. The relational operators get applied on the data, producing the derived data ("materialized views") and eventually the results. So, even though the Aleri models were usually expressed in XML (though an SQL compiler was also available), fundamentally it's a very relational and SQLy approach.

This creates a few nice properties. All steps of execution can be pipelined and executed in parallel.For persistence, it's fundamentally enough to keep only the input data (what has been called BaseStreams and then SourceStreams), and all the derived computations can be easily reprocessed on restart (it's funny but it turns out that often it's faster to read a small state from the disk and recalculate the rest from scratch in memory than to load a large state from the disk).

It also has issues. It doesn't allow loops, and the procedural calculation aren't always easy to express. And keeping all the state requires more memory. The issues of loops and procedural computations have been addressed by FlexStreams: modules that would perform the procedural computations instead of relational operations, written in SPLASH - a vaguely C-ish or Java-ish language. However this tends to break the relational properties: once you add a FlexStream, usually  you do it for the reasons that prevent the derived calculations from being re-done, creating issues with saving and restoring the state. Mind you, you can write a FlexStream that doesn't break any of them, but then it would probably be doing something that can be expressed without it in the first place.

Coral8 has grown from the opposite direction: the idea has been to process the incoming data while keeping a minimal state in variables and short-term "windows" (limited sliding recordings of the incoming data). The language (CCL) is very SQL-like. It relies on the state of variables and windows being pretty much global (module-wide), and allows the statements to be connected in loops. Which means that the execution order matters a lot. Which means that there are some quite extensive rules, determining this order. The logic ends up being very much procedural, but written in the peculiar way of SQL statements and connecting streams.

The good thing is that all this allows to control the execution order very closely and write things that are very difficult to express in pure un-ordered relational operators. Which allows to aggregate the data early and creatively, keeping less data in memory.

The bad news is that it limits the execution to a single thread. If you want a separate thread, you must explicitly make a separate module, and program the communications between the modules, which is not exactly easy to get right. There are lots of people who do it the easy way and then wonder, why do they get the occasional data corruption. Also, the ordering rules for execution inside a module are quite tricky. Even for fairly simple logic, it requires writing a lot of code, some of which is just bulky (try enumerating 90 fields in each statement), and some of which is tricky to get right.

The summary is that everything is not what it seems: the Aleri models aren't usually written in SQL but are very declarative in their meaning, while the Coral8/StreamBase models are written in an SQL-like language but in reality are totally procedural.

Sybase is also striking for a middle ground, combining the features inherited from Aleri and Coral8 in its CEP R5 and later: use the CCL language but relax the execution order rules to the Aleri level, except for the explicit single-threaded sections where the order is important. Include the SPLASH fragments for where the outright procedural logic is easy to use. Even though it sounds hodgy-podgy, it actually came together pretty nicely. Forgive me for saying so myself since I've done a fair amount of design and the execution logic implementation for it before I've left Sybase.

Still, not everything is perfect in this merged world. The SQLy syntax still requires you to drag around all your 90 fields into nearly  every statement.  The single-threaded order of execution is still non-obvious. It's possible to write the procedural code directly in SPLASH but the boundary where the data passes between the SQLy and C-ish code still has a whole lot of its own kinks (less than in Aleri). And worst of all, there is still no modular programming. Yeah, there are "modules" but they are not really reusable. They are tied too tightly to the schema of the data. What is needed, is more like C++ templates.

4 comments:

  1. Hello! I'm a runtime engineer over at StreamBase. I just wanted to let you know about some of the work we're doing over here to ameliorate some of the issues you've pointed out.

    Specifically: "Yeah, there are "modules" but they are not really reusable. They are tied too tightly to the schema of the data. What is needed, is more like C++ templates."

    We've come up with a way of cleanly allowing type parameterization of StreamBase modules, released in StreamBase version 7.2. The documentation is here; the research paper I presented at the last Distributed and Event Based Systems conference is here.

    ReplyDelete
  2. The capture fields are definitely a step in the right direction. However they don't seem to solve the field-names-as-parameters problem. They probably could implement the first example in http://babkin-cep.blogspot.com/2011/12/little-about-templates.html, though maybe with a more cumbersome calling sequence, but I think not the second one.

    ReplyDelete
  3. Checking out the "A little about templates article"; I debated posting this comment there, but I think keeping the conversation in one place might make more sense.

    For the first example:

    Yes, capture fields enable doing this in a clean way. The idiom in StreamBase would be to create a module that exports a table (as queryable as any other table) and maintains the contents of that table as desired. Capture fields are what allow the exported table's schema to be based on any input schema to the module.

    For the second example:

    This kind of "templating" is exactly what we use StreamBase "module parameters" for. They work very much like you suggested your template language might work, generating a "substituted" version of a StreamBase module that uses the parameters specified at each module call site.

    Unfortunately, module parameters do suffer from all the problems you'd expect of any kind of templating system that generates code: errors get arcane and difficult to debug, static type guarantees become a lot more difficult, compilation times become longer because less work can be shared, etc. Capture fields allow us to avoid these problems of templating for a wide range of real-world cases: whenever the modules are parameterized by type, instead of by arbitrary expression.

    ReplyDelete
  4. For (1) it actually sounds like StreamBase simply doesn't need this particular pattern because it allows the direct access to tables, which are the same thing as Coral8 calls windows. Same as Aleri.

    For (2), I don't really understand how StreamBase works, but it looks like it can't really take a list of fields as a parameter value. It's the same limitation as the C++ templates have. I've been feeling for a long time that the C preprocessor and later C++ template should have the directives for conditional compilation and loops in the macros/templates. The lack of the loops leads to all kind of recursive monstrosities in the Alexandrescu book about C++ templates.

    I hope that using Perl for the macro-language can solve both the problem with loops in code generation and with debugging difficulties: provide the flexible code generation, and also the more meaningful high-level error messages. I.e. when an error is found, the code can provide a more meaningful message of what is wrong with the template parameters rather than simply dumping the call trace. And of course, it could do the explicit checks of template parameters before using them.

    ReplyDelete