Sunday, November 10, 2013

Triceps performance

I've finally got interested enough in Triceps performance to write a little test, Perf.t. By default it runs only one thousand iterations, to be fast and not delay the run of the full tests suite. But the number can be increased by setting an environment variable, like:

$ TRICEPS_PERF_COUNT=100000 perl t/Perf.t

An important caveat, the test is of the Perl interface, so it includes all the overhead of constructing the Perl objects. I've tried to structure it so that some of the underlying performance can be deduced, but it's still approximate. I haven't done the performance testing of just the underlying C++ implementation yet, it will be better.

Here are the numbers I've got on my 6-year old laptop (dual-CPU Intel Core2 T7600 2.33GHz) with explanations. The time in seconds for each value is for the whole test loop. The "per second" number shows, how many loop iterations were done per second.

The computations are done with the real elapsed time, so if the machine is not idle, the time of the other processes will get still counted against the tests, and the results will show slower than they really are.

Performance test, 1000 iterations, real time.

The first thing it prints is the iteration count, to set the expectations for the run length and precision.

Empty Perl loop 0.000083 s, 11983725.71 per second. 

A calibration to see, how much overhead is added by the execution of the loop itself. As it turns out, not much.

Row creation from array and destruction 0.003744 s, 267085.07 per second. 

The makeRowArray() for a row of 5 fields. Each created row gets destroyed before the next one gets created.

Row creation from hash and destruction 0.006420 s, 155771.52 per second.

The makeRowHash() for a row of 5 fields.


Rowop creation and destruction 0.002067 s, 483716.30 per second.

The makeRowop() from an existing row. Same thing, each rowop gets destroyed before constructing the next one.

Calling a dummy label 0.001358 s, 736488.85 per second.

Repeated calls of a dummy label with the same rowop object.

Calling a chained dummy label 0.001525 s, 655872.40 per second.
  Pure chained call 0.000167 s, 5991862.86 per second.



Repeated calls of a dummy label that has another dummy label chained to it. The "pure" part is the difference from the previous case that gets added by adding another chained dummy label.


Calling a Perl label 0.006669 s, 149946.52 per second.

Repeated calls of a Perl label with the same rowop object. The Perl label has an empty sub but that empty sub still gets executed, along with all the support functionality.

Row handle creation and destruction 0.002603 s, 384234.52 per second.

The creation of a table's row handle from a single row, including the creation of the Perl wrapper for the row handle object.

Repeated table insert (single hashed idx, direct) 0.010403 s, 96126.88 per second.

Insert of the same row into a table. Since the row is the same, it keeps replacing the previous one, and the table size stays at 1 row. Even though the row is the same, a new row handle gets constructed for it every time by the table, the code is $tSingleHashed->insert($row1). "Single hashed idx" means that the table has a single Hashed index, on an int32 field. "Direct" means the direct insert() call, as opposed to using the table's input label.

Repeated table insert (single hashed idx, direct & Perl construct) 0.014809 s, 67524.82 per second.
  RowHandle creation overhead in Perl 0.004406 s, 226939.94 per second.


The same, only the row handles are constructed in Perl before inserting them: $tSingleHashed->insert($tSingleHashed->makeRowHandle($row1)). And the second line shows that the overhead of wrapping the row handles for Perl is pretty noticeable (it's the difference from the previous test case).

Repeated table insert (single sorted idx, direct) 0.028623 s, 34937.39 per second.

The same thing, only for a table that uses a Sorted index that executes a Perl comparison on the same int32 field. As you can see, it gets 3 times slower.

Repeated table insert (single hashed idx, call) 0.011656 s, 85795.90 per second.

The same thing, again the table with a single Hashed index, but this time by sending the rowops to its input label.

Table insert makeRowArray (single hashed idx, direct) 0.015910 s, 62852.02 per second.
  Excluding makeRowArray 0.012166 s, 82194.52 per second.


Now the different rows get inserted into the table, each row having a different key. At the end of this test the table contains 1000 rows (or however many were requested by the environment variable). Naturally, this is slower than the repeated insertions of the same row, since the tree of the table's index becomes deeper and requires more comparisons and rebalancing. This performance will be lower in the tests with more rows, since the index will become deeper and will create more overhead. Since the rows are all different, they are created on the fly, so this row creation overhead needs to be excluded to get the actual Table's performance.

Table insert makeRowArray (double hashed idx, direct) 0.017231 s, 58033.37 per second.
  Excluding makeRowArray 0.013487 s, 74143.61 per second.
  Overhead of second index 0.001321 s, 756957.95 per second.


Similar to previous but on a table that has two Hashed indexes (both on the same int32 field). The details here compute also the overhead contributed by the second index.

Table insert makeRowArray (single sorted idx, direct) 0.226725 s, 4410.64 per second.
  Excluding makeRowArray 0.222980 s, 4484.70 per second.


Similar but for a table with a Sorted index with a Perl expression. As you can see, it's about 20 times slower (and it gets even worse for the larger row sets).

Nexus pass 0.034009 s, 29403.79 per second.

The performance of passing the rows between threads through a Nexus. This is a highly pessimistic case, with only one row per nexus transaction. The time also includes the draining and stopping of the app.

And here are the numbers for a run with 100 thousand iterations, for comparison:

Performance test, 100000 iterations, real time.
Empty Perl loop 0.008354 s, 11970045.66 per second.
Row creation from array and destruction 0.386317 s, 258854.76 per second.
Row creation from hash and destruction 0.640852 s, 156042.16 per second.
Rowop creation and destruction 0.198766 s, 503105.38 per second.
Calling a dummy label 0.130124 s, 768497.20 per second.
Calling a chained dummy label 0.147262 s, 679062.46 per second.
  Pure chained call 0.017138 s, 5835066.29 per second.
Calling a Perl label 0.652551 s, 153244.80 per second.
Row handle creation and destruction 0.252007 s, 396813.99 per second.
Repeated table insert (single hashed idx, direct) 1.053321 s, 94937.81 per second.
Repeated table insert (single hashed idx, direct & Perl construct) 1.465050 s, 68257.07 per second.
  RowHandle creation overhead in Perl 0.411729 s, 242878.43 per second.
Repeated table insert (single sorted idx, direct) 2.797103 s, 35751.28 per second.
Repeated table insert (single hashed idx, call) 1.161150 s, 86121.54 per second.
Table insert makeRowArray (single hashed idx, direct) 1.747032 s, 57239.94 per second.
  Excluding makeRowArray 1.360715 s, 73490.78 per second.
Table insert makeRowArray (double hashed idx, direct) 2.046829 s, 48856.07 per second.
  Excluding makeRowArray 1.660511 s, 60222.41 per second.
  Overhead of second index 0.299797 s, 333559.51 per second.
Table insert makeRowArray (single sorted idx, direct) 38.355396 s, 2607.20 per second.
  Excluding makeRowArray 37.969079 s, 2633.72 per second.
Nexus pass 1.076210 s, 92918.63 per second.

As you can see, the table insert performance got worse due to the added depth of the index trees while the nexus performance got better because the drain overhead got spread over a larger number of rows.

Wednesday, November 6, 2013

redundancy

Here is another entry on the general things.

Fairly often we want things to be redundant: run two or more copies of a system, and if one of them fails, another one picks up after it. Two is the minimal number of the instances, and it's a pretty unstable one: besides a chance of one instance failing, there is also a chance of the network partitioning. In the case of the network partitioning you don't want two copies to start messing with the data independently, instead you really want to still keep only one copy as the master and shut down the other one. But the failure and partitioning are pretty hard to tell apart, which makes the 2-instance configurations quite unstable.

Even if you have a 3-instance configuration (a good idea overall), you're still not immune. If one instance goes down for the scheduled maintenance, you're back to the 2-instance situation.

How can the situation be made more stable?

One obvious but expensive option is to just create more instances. Not only it's expensive but it also adds its own problems. Suppose, there are 4 instances to start with, and one instance finds that two other instances went down. Does it mean that these two instances just died (possibly from some common reason) or that a network partitioning had occurred?

An improvement on that would be create the extra instances not as the full ones but just as "beacons", only responding for the purpose of watching the partitioning and not actually running the code. Then the load on these beacon instances will be low, and they can be combined with machines running some other systems. Or a dedicated machine can serve as a beacon for many systems in the company. Well, once you ask "what if a beacon goes down", this starts growing a bit into a system of redundant beacons and their own problems of partitioning. And it could actually get stupid with both instances seeing the beacon but not seeing each other, but that can be resolved by the communication through the beacon.

But a typical situation is a pipeline of systems. Each system is connected to one or more source systems and/or one or more sink systems (sou it could be not only a straight pipeline but technically a tree). And each system in the pipeline would be duplicated or triplicated and each instance cross-connected to all the duplicates of each source and sink. In this situation the master node of  a source can be used as such a beacon! This would make a 2-instance configuration still stable.