Sergey Babkin on CEP and stuff: general

Showing posts with label general. Show all posts

Thursday, August 7, 2025

programming by example

Everyone is excited one way or another about the "vibe programming" with AI. Will it replace the professional programmers? In short, I'd say, probably not. Will it improve productivity? I think, probably yes, but maybe not in the ways that people envision it now. Will it allow non-programmers write programs without involving professionals? Definitely yes (and observing such an experience is what prompted me to write this post) but probably not in the industrial settings.

My personal experience of using the AI is not really successful. But that's probably because I'm not asking it to build the simple things. I ask it things when I get stumped, as a shortcut instead of digging through piles of code or searching through (often non-existent) documentation. And the reality is that when I get stumped, the AI gets stumped too. It offers solutions but they don't work, like referring to some non-existing libraries. But it's still useful, because this shows the at least semi-working examples and points to a direction for further digging, either in the code or in the documentation, and happens to work better than a simple search.

The real important part here is the generated examples. Note that the previous version of programming-without-understanding has been to search and copy-paste the examples from Stackoverflow and such. And the same works even in your own code: it's always harder to write some new thing from scratch, but once you have an example of something similar in your codebase, you can copy and modify it very quickly and easily. Over 25 years ago I've had an argument about a man page: I've been saying that we should add an example of usage of the function at the end of the man page, and a more senior engineer was telling me that no, the examples don't belong in the man pages, they should only contain the descriptions. The experience of the following 25 years definitely proved me right: it's much easier to understand the usage of a function from an example and copy-paste that example than by reading a lengthy description and trying to figure out all the fine points and all the ambiguities in it. When we write APIs, we write them with some specific usage in mind (with some exceptions like a bunch of Enterprise Java APIs, but that's the reasob why these are horrible). And it's much better and easier for everyone to provide the examples of this usage. The "test-driven programming" is another example of the importance of examples: it tells us to start by writing the examples of usage, and only then the implementation. The name of "test-driven" is actually misleading, since the examples are not tests. The point of examples is to show the typical usage, the point of the tests is to show the corner cases in a much more detailed and elaborate way. So if you take that name at face value and think that the examples are the tests, or start by writing the actual tests, you're going to have a bad time. But if you think of it as "example-driven programming", it's great. And that's basically what the AI-generated code is: the examples on steroids, fine-tuned to your own case but not necessarily fully correct.

By the way, if you think that the programming ability comes to the AI naturally, you're mistaken. There are companies that specialize on producing the labeled datasets for AI training, and they provide the specialized datasets for programming too. Moreover, they do the fine-tuning, running sessions through corrections and modifications to analyze the errors and make the code work, with these sessions incorporated into the training data (and yes, the same kind of fine-tuning happens for the non-programming subjects too).

And, well, this is also how it works for the non-professionals: the AI supplies them with tailored examples that often work correctly for the simple problems. An important point is that the problems might be logically simple but require a wide knowledge of a huge library that would take a professional programmer a long time to learn the old-fashioned way. So that's also a great way for the professional programmers to start using and learn the important parts of an unfamiliar library. But also what is trivial for a professional, isn't trivial for someone who doesn't know much about programming. This definitely brings programming to the masses. And if in some professional settings these people (say, quantitative analysts) can hire professional programmers for help, there are lots of people who could use the programming casually but can't hire a specialist. The AI is a great resource for them. But does it threaten the professional labor market? Probably not, because these are mostly people who couldn't afford to hire a professional in the first place. Although it probably would affect the low end of the market, where people hire a barely competent programmer for specialized repeated tasks. However setting the problems for AI is also a skill that isn't equally mastered by everyone, so there will be a new market for people who specialize in managing the AI.

The problem that the non-professionals have with the AI-generated code is that even the small errors stump them. Accidentally lose or insert an extra backslash in a regular expression, and nothing works any more, and someone who doesn't know the regular expressions will never be able to tell why. Perhaps the AI could be trained to do the debugging by offering the things to try and paste back the printouts. Like, you know, when in the 1960s a programmer instructed a non-professional client over the phone. This AI-assisted debugging would also come useful for the professionals.

How about the more general predictions for the rest of us? Will it run us out of business? My guess is that it will change the relative value of the skills similarly to how the introduction of the compilers changed them. The wider use of AI will probably devalue the specialized knowledge of various libraries, making them easier to learn by example. It would probably also create the better code analysis tools, being able to expand on the fly all the "auto" declarations into actual types and where did this variable come from. There probably will be new AI-oriented programming languages too. The human-oriented programming languages are built towards reducing the redundancy, making things easier to express in a concise way, manageable for the limited attention span of a human mind. The AI-oriented programming languages would probably have a lot more things spelled out explicitly as a way to catch more errors in the AI-generated code. Perhaps even the the same languages will have the human and AI versions, with all the explicitly spelled-out redundant statements either hidden or expanded. And the expanded form would also be a great code analysis tool for the humans.

Saturday, April 19, 2025

BCPL, B, and C

Have you ever wondered, what were the earlier BCPL and B languages? As it turns out, Bell Labs still hosts Dennis Ritchie's historic web pages (http://cm.bell-labs.co/who/dmr/) that include the descriptions of BCPL (http://cm.bell-labs.co/who/dmr/bcpl.pdf) and B (http://cm.bell-labs.co/who/dmr/kbman.pdf), and also the source code of a couple of early versions of the C compiler (http://cm.bell-labs.co/who/dmr/primevalC.html).

So for what I can tell, B and C aren't exactly the breaking points in the language itself, instead the language had gradually evolved from BCPL (just as according to the stories there weren't any versions of Unix inside Bell Labs research, it was just snapshotted and packaged as versions for external consumption). BCPL is a very old language, from the times when the computers had widely differing character sets (and mostly in 6-bit encodings), so the language document defined the logical lexemes that got mapped to specific characters in each implementation. So you could see how they'd say "let's use curly braces for the block start and end", "let's skip the statements that are duplications to save memory" (BCPL had separate keywords for positive and negative conditions in the loops, and also the version of "if" statement with "else" was named "test"), "some more operators would be convenient", and get a Bell Labs dialect of BCPL that became B. And then B is almost the same as an early version of C.

It looks like the breaking points for "a new language" were in fact determined by the compiler back-end, the code generation. B went from direct assembly to "threaded code" (having nothing to do with the our modern idea of threads), which was apparently a popular concept in the 1970s (and still most widely known from the Forth language). The idea there is that the program is built of a sequence of subroutine addresses, each of them essentially an instruction of a virtual machine. But these instructions don't need parsing, all you need to do is to jump to the address and you get to the machine-code subroutine that implements this virtual instruction. Except that at the end of the subroutine it doesn't do a return, instead it does the jump to the address in the next virtual machine instruction. On PDP-11 this jump was done as

jmp *(r3)+

with r3 containing the program counter of the virtual machine. That's where the "threaded" in the name comes from, r3 is the "thread" that sews together the code fragments. The advantage is that the generated code is more compact than the direct machine code (a big deal for the very small memories of the day), and also as the interpreters go, it's a very fast one, losing to the direct machine code only by a factor of 3-5 or so. The idea of threaded code can still be useful for the asynchronous programming, as described in https://babkin-cep.blogspot.com/2025/02/asynchronous-programming-9-composition.html. And then C went back from the threaded code to the direct machine code generation.

Some interesting tidbits about BCPL and B:

* The lvalue and rvalue terms come all the way from the CPL (of which BCPL is a reduced version). There are no words "pointer" nor "reference" nor even "address" in the BCPL document, only "lvalue". It actually would be very difficult to understand without the prior knowledge of lvalue and rvalue terms, but my guess is that it builds on the CPL definition with a better explanation.

* BCPL is a word-based language where the addresses are of the words, and so there is no difference between integer and pointer arithmetic. So when implemented on PDP-11, a byte-addressed 16-bit machine, B still has lvalues as word addresses, shifted right by one bit from the machine addresses. So adding 1 to the address actually does add 1 to the number, and the opposite shift left by one bit is done only when dereferencing the pointer.

* The strings had both '\0' at the end and the 1-byte string length at the front.

* The arrays (naturally, of words, as this was the only supported type) used one more word before the array to store the address (lvalue) of the first word of the array, rather than using a constant for that. And that lvalue was also modifiable!

* The multi-character character constants that still exist in C originate in BCPL, where you could specify up to one word's worth of characters (which might be six).

* The "break" statement (but not "continue") existed in BCPL, disappeared in B, and then reappeared in C.

* BCPL had both symbolic constants and a preprocessor, both of which disappeared in B.

Saturday, June 3, 2023

the first actual use of Triceps

I've recently got Triceps used for a real application, for the first time ever, after approximately 13 years. Not really a production usage, rather a scientific demo, but it's an outside project, not part of Triceps itself. Embedded in Matlab, or all things. No, not a full Matlab API, but an application written in C++ embedded as a native function into Matlab. Which is exactly what Triceps was designed for.

But really Matlab and Triceps with a full Matlab API might be a match made in heaven. Matlab really sucks at two things: the data structures where you can append and search by value, and parallelism. And those are things where Triceps excels. Yeah, Matlab has the parfor loops but the trouble there is that the parallel threads (they're more typically processes but logically still threads) are stateless. All the input data is packed up and sent to them (at a great overhead), and then the results are packed up and sent back. You can't just preserve state in a thread between two calls, it has to be sent from scratch every time. And no, it doesn't seem to be the same constant in shared memory read in parallel by multiple threads. It actually gets copied for every thread. So parfor only works well when you send a small amount of data, process it for some long time, and then send a small result back. Not well when you want your thread to make queries to a large database. But keeping state is what a Triceps thread does. The Triceps threads are also easy to organize in pipelines. (Yeah, Matlab does have some pipelines in one of extensions, but they seem as cumbersome as parfor). And the read-only shared memory would work too if queried through Triceps and only small results of the queries get returned to Matlab. It could work really, really awesomely together. The trouble of course is that I personally don't have much of an incentive to do this work.

That's the good part. What went badly? Well, Triceps in C++ feels rather dated. It's a project that was started in 2010, before C++11, and it feels like that. I didn't even aim for C++ to be the main language, but more as an API for embedding into the other languages. But now it can be done much more smoothly straight in C++. So if I ever get to it, a modernization of the C++ API is in order. Some libraries are dated too, in particular NSPR. It also proved to be a pain in building: I haven't thought much about it before, but the definitions used in building of the applications that use Triceps have to be the same as when building Triceps itself. So if triceps is built with NSPR, and the application doesn't include the right definitions for the Triceps headers to use NSPR, it crashes in a surprising way. Fortunately, the standard C++ library now has APIs for the atomic values, so transition to that API is now in order. On the other hand, shared_ptr is a more complicated question, and keeping the Triceps Autoref might still be better for efficiency.

Wednesday, November 7, 2018

how to reinvent the bicycle 2

I've been thinking more about the problem of teaching the engineering
approach to problem solving, and I've decided to try using another
interview problem as an example. It's an older problem, and I really like
it, but I don't use it any more because the other people liked and used it
too, and the solutions are already available on the Internet.

I'm not just going to give the solution, blindly remembering that is of no
use to anyone. I want to show how the better solutions can be born out of
the bad solutions. I'm going to start with the worst solution I can think
of, and then gradually show the better solutions. The puzzle for you, the
reader, is to use the issues in these solutions as hints towards
the better solutions, that would take you as far ahead as possible.

I wrote those solutions as I would do at an interview, without actually
compiling and running the code on a computer. So they might contain bugs,
from trivial bugs to the major ones. Whenever I've noticed a major bug
after writing a solution and following through its execution manually, I've
left the buggy version in, and added the fixed version. Sometimes a solution
was so overcomplicated that I didn't notice a bug in it until I moved on to
another, simpler solution. There still might be bugs, but hopefully not too
bad ones. It just shows that the overcomplicated approaches should be
avoided.

The problem is to write a matcher for the very simple regular expressions,
that include only the operators "." (any character) and "*" (any number of
repeats of the previous character). The "*" is greedy, consuming as many
matching characters as possible. There is no escape character like
backslash. The string must match completely, as if the regexp implicitly
had "^" at the font and "$" at the end.

For the Windows people who don't know regular expressions, the regexps are
a way of pattern matching, kind of like the file patterns (which are known
in the Unix world as globs) but different. Most of the characters in the
regexps have the literal meaning, just as in the file patterns. The
character "." is like "?" in the file patterns, matching exactly one any single
character. The character "*" is different than in the file patterns. It
means zero or more repetitions of the previous element, so the file pattern
"*" is analogous to the regexp ".*".

The function declaration will be:

bool match(const char *pattern, const char *text);

Let's start with the analysis.

The first thing to notice about this problem is that some regular expressions
in it are impossible to match. The "a*a" will never match anything because
the greedy "a*" will consume all the "a"s, and the second "a" will never encounter
a match. The same goes for "*." followed by anything, because ".*" will
consume everything to the end of the string.

The other thing to consider is, could the strings be in UTF-8? UTF-8 can
be handled transparently in many ways but this is not one of them. If
treated simple-mindedly, the "*" after a multi-byte Unicode character would
expect the repeat of its last byte. And a "." would match a single byte of
a multibyte character. Specifically for UTF-8, this problem
can be handled by keeping track of the continuation. Every byte of a
multi-byte character in UTF has the high bit 0x80 set, and the first one of
them has also the bit 0x40 set, so you can parse where the character
starts. I've looked up the exact bitmasks on the Internet, it's not
something that I'd expect anyone to know off-hand at the interview, nor the
details of the exact UTF-8 encoding, but the general principle that UTF-8
is easy to parse is a reasonable knowledge. However this kind of parsing is a
bad idea anyway, because it would mess up the handling of the strings in
the single-byte 8-bit encodings. A much better approach is to use the
library functions to either convert the whole strings to wchar_t (requires
the allocation of extra memory) or to parse the strings
character-by-character as they get read (better). This is important to
mention to show that you understand the multi-byte encoding issues, and can
bring you bonus points at an interview. But the problem itself is not about
that, the multi-byte support can be trivially added after the general
solution is found, so let's assume that we don't need to care about that
for now, and the strings are in ASCII.

The first solution goes in the most complicated way I can think of. You
might have attended a college course on parsing that talked about the
finite machine matcher for the regular expressions. The most unsophisticated
approach is to push this way blindly.

Each node of the finite machine graph could be a plain array of the
possible exits from that node, but to make things easier to write, let's
use a C++ map:

using namespace std;

struct Node {
    map<char, Node*> exits_;
};

The parsing of the pattern will start like this:

bool match(const char *pattern, const char *text) {
    vector<shared_ptr<Node> > holder;
    Node *parsed = nullptr;

    for (const char *p = pattern; *p != 0; p++) {
        ...
    }

}

Since we're dynamically allocating the Nodes, we need to take
care of freeing them too. A simple way to do it is to use the reference
counting in the form of shared_ptr or unique_ptr. But we can't store them
right in the Node, because the Nodes may have circular references. The
unique_ptr just will refuse to do so (if you insist with std::move(),
you'll get NULL pointers in the unexpected places, and will probably
crash), and shared_ptr would create a link loop and leak the memory. It's
possible to arrange the entries in a plain list, and then free them at the
end manually. But the easier way to do it is to keep the reference counts
in a separate vector, and let the nodes connect in whatever ways with the
pointers. And these should be the real pointers, not the weak references.
The weak references are needed when you might need them to become strong at
some point. Here we have no such need, so save the overhead.

We can also use the holder to find the first and last allocated node, so we
don't need the separate parsed pointer.

What should the loop do? Since the Nodes contain the map of exit links and
not entry links, on each literal character in the pattern we'll need to
create a new node, and then update the _previous_ node with the exit link
to the new one. So we need to start by allocating one initial node before
the loop, to always have a previous node.

When we see a '.', we populate the map for every character code except '\0' to
point to the next node. If we had to deal with the wide characters, this
wouldn't work, because the number of characters would be too great. Then
we'd have to add a special "default" link. But for ASCII it's good enough.

When we see '*', we don't allocate a new node but make the previous node a
copy of its previous node, pointing to itself (so we should also keep a
pointer to the node before last).

And when we see a '\0', we will add to the last node an exit that maps the
'\0' to a NULL pointer, this being a signal to stop. We could also add a
separate boolean field final_ for this purpose, but a null pointer is just
as good, and safer in the way that we wouldn't need a separate code path
for checking the finality, and will always check all the pointers for NULL.

Here is the loop fleshed out:

bool match(const char *pattern, const char *text) {
    vector<shared_ptr<Node> > holder;
    holder.push_back(make_shared<Node>());
    Node *previous = nullptr;

    for (const char *p = pattern; *p != 0; p++) {
        if (*p == '*' && previous != nullptr) {
            holder.back()->exits_ = previous->exits_;
        } else {
            previous = holder.back().get();
            holder.push_back(make_shared<Node>());

            if (*p == '.')  {
                for (int c = 1; c <= 255; c++) {
                    previous->exits_[(char)c] = holder.back().get();
                }
            } else {
                previous->exits_[*p] = holder.back().get();
            }
        }
    }
    holder.back()->exits_['\0'] = nullptr;

}

This code has to deal with some situations that "shouldn't happen", the
invalid patterns that either start with a "*" or have multiple "*" in a
row. This is something that needs to be called out explicitly. Two ways to
deal with it is to either return some kind of an error indication (such as
throw an exception) or just process them in some not officially defined
way. This code takes the approach of the silent handling, simply because
it's easier to do in a small code snippet. From the caller's standpoint it
might be either good or bad: the good is that the caller won't have to
handle the errors, the bad is that the author of the pattern might be
surprised by its effect. Whatever choice you make, you need to be aware
of this trade-off. The silent handling here is that the "*" at the start is
treated as a literal, and multiple "*" are treated as one.

This code is also not correct in the handling of "*", in two ways! How do I
know this? I tried to run through it manually on the examples of what I see
like corner cases: "", "a", ".", "a*", "a*b", "a*a", ".*", ".*a" with the
matching and mismatching strings, and I actually tried the more complex
cases first.

The first defect is that it doesn't handle the greediness right. A pattern
like ".*a" will not be unmatchable, instead it would mean what in the full
regexps can be expressed as "[^a]*a". Which might be not bad by itself but
isn't what the problem statement says. Worse yet, due to the same mistake
the pattern "a*a" will be equivalent to a single "a", which is outright
wrong. The fix for this problem is to check whether the map value at this
key has been already set, and if so then don't overwrite it:

struct Node {
    map<char, Node*> exits_;

    void set_exit(char c, Node *to) {
        if (exits_.find(c) == exits_.end()) {
            exits_[c] = to;
        }
    }
};

bool match(const char *pattern, const char *text) {
    vector<shared_ptr<Node> > holder;
    holder.push_back(make_shared<Node>());
    Node *previous = nullptr;

    for (const char *p = pattern; *p != 0; p++) {
        if (*p == '*' && previous != nullptr) {
            holder.back()->exits_ = previous->exits_;
        } else {
            previous = holder.back().get();
            holder.push_back(make_shared<Node>());

            if (*p == '.')  {
                for (int c = 1; c <= 255; c++) {
                    previous->set_exit((char)c, holder.back().get());
                }
            } else {
                previous->set_exit(*p, holder.back().get());
            }
        }
    }
    holder.back()->set_exit('\0', nullptr);

}

The second problem is even worse, "*" ends up meaning "1 or more
repetitions" instead of "0 or more repetitions", like the operator "+" in
the full regexps. To fix that, on the next character after a "*" we need to
add the new link to both the previous and pre-previous node.

The state machine graph for the expression "a*b" used to look like this:

 
      a
     +-+
   a \ /  b    \0
O-----O-----O-----X

Now it will look like this:

 
      a
     +-+
   a \ /  b    \0
O-----O-----O-----X
 \         /
  +-------+
     b

The mistake I've made here is that I haven't started writing out the
explicit graph examples until after finding the bug. It could have been
prevented by thinking through the graphs from the start.

The code change is:

bool match(const char *pattern, const char *text) {
    vector<shared_ptr<Node> > holder;
    holder.push_back(make_shared<Node>());
    Node *previous = nullptr;
    Node *before_star = nullptr;

    for (const char *p = pattern; *p != 0; p++) {
        if (*p == '*' && previous != nullptr) {
            if (before_star == nullptr) {
                holder.back()->exits_ = previous->exits_;
                before_star = previous;
            }
        } else {
            previous = holder.back().get();
            holder.push_back(make_shared<Node>());

            if (*p == '.')  {
                for (int c = 1; c <= 255; c++) {
                    previous->set_exit((char)c, holder.back().get());
                    if (before_star) {
                        before_star->set_exit((char)c, holder.back().get());
                    }
                }
            } else {
                previous->set_exit(*p, holder.back().get());
                if (before_star) {
                    before_star->set_exit(*p, holder.back().get());
                }
            }
            before_star = nullptr;
        }
    }
    holder.back()->set_exit('\0', nullptr);
    if (before_star) {
        before_star->set_exit('\0', nullptr);
    }

}

This finally took care of the pattern parsing. Mostly. It still has a bug
with handling the sequences of starred characters, such as matching the
pattern "a*b*c" to the string "c". It will not accept "c" because at each
point we update only one previously starred node, not the whole sequence of
them. This is a pretty tricky code, and I didn't even notice this error
until I started doing this problem the next, easier, way, and noticed it
there, and then realized that the first version had this issue too.

As it has turned out after I drew the graph and traced the code, it's even
worse than that. In the "a*b*c", it would accept any mix of "a"s and "b"s,
rather than strictly "a"s followed by "b"s, caused by the copying of all the
links from the node before star.

Well, let's fix it. The graph for "a*b*c" used to look like this:

 
      a      a,b 
     +-+     +-+ 
   a \ /  b  \ /  c       \0
O-----O-------O-------O-------X
 \     \     /       /
  +-----\---+       /
   b     +---------+
              c

It should look like this:

 
      a       b 
     +-+     +-+ 
   a \ /  b  \ /  c       \0
O-----O-------O-------O-------X
|\     \     /       /|
| +-----\---+       / |
|  b     +---------+  |
|             c       |
 \                   /
  +-----------------+
          c

Whenever there is a sequence of starred characters, there has got to be a
link from before each starred node to each following starred node, and to
the node past the stars. The only way I could figure this out was with the
graph diagram.

This is very difficult to do by just following the graph, because there is
no indication in the graph, which link is going to the following node and
which go to the other nodes including itself.

The better way to resolve it is to notice that we're appending the nodes to
the vector sequentially anyway. So we can use the vector order to go
through them in sequence. Thus the variable before_star turns from a
pointer into a vector index:

struct Node {
    map<char, Node*> exits_;

    void set_exit(char c, Node *to) {
        if (c == '.') {
            for (int i = 1; i <= 255; i++) {
                if (exits_.find((char)i) == exits_.end()) {
                    exits_[i] = to;
                }
            }
        } else {
            if (exits_.find(c) == exits_.end()) {
                exits_[c] = to;
            }
        }
    }
};

bool match(const char *pattern, const char *text) {
    vector<shared_ptr<Node> > holder;
    holder.push_back(make_shared<Node>());
    int first_link = 0;
    char last_char;

    enum State {
        AFTER_STAR,
        AFTER_LITERAL,
    };
    State state = AFTER_LITERAL;

    for (const char *p = pattern; *p != 0; p++) {
        if (*p == '*' && holder.size() > 0) {
            holder.back()->set_exit(last_char, holder.back().get());
            state = AFTER_STAR;
        } else {
            last_char = *p;
            if (state == AFTER_LITERAL) {
                first_link = holder.size() - 1;
            }

            holder.push_back(make_shared<Node>());
            for (int i = first_link; i < holder_size() - 1; i++)
                holder[i]->set_exit(*p, holder.back().get());

            state = AFTER_LITERAL;
        }
    }
    for (int i = first_link; i < holder_size() - 1; i++)
        holder[i]->set_exit('\0', nullptr);

}

The setting of all exits on a "." became used in multiple places, so it
moved into set_exit(). The parsing of the star syntax became more
complicated, requiring to remember the last character, and to keep the
state for detection of where the sequence of starred characters ends.
The "classic" way of state handling is to put the state processing as a
switch statement in the loop but that would have made the code way too
painful and repetitions, so I've decided to skip that example of doing
things per an unsuitable pattern and move on right to the more
straightforward version above.

Now let's do the pattern matching:

bool match(const char *pattern, const char *text) {
    // ... skipped the pattern parsing ...

    Node *cur = holder.front().get();
    for (const char *t = text; cur != nullptr; t++) {
        auto it = cur->exits_.find(*t);
        if (it == cur->exits_.end())
            return false;
        cur = it->second;
    }
    return true;
}

It's done but is hadn't been a small feat, with all the memory management
thrown in, and all the opportunities to make the mistakes in the pattern
building. It can be powered though as a show of brute force, but I don't
think that I've done it all in under an hour, which it the usual interview
time slot.

I hope that at the very least by the time I've started talking about
building the holder vector, you've got a better idea: this vector goes in
parallel with the original pattern string, so why not just work directly
with the pattern string? Indeed, this is a much easier way. But it also has
the harder and easier version.

Again, let's start with the harder version. The loop will be working in
very much the same way as the matching loop in the parsed-pattern version
(as some textbooks would teach), but will read the pattern directly from
the string.

bool match(const char *pattern, const char *text) {
    char last_pattern = '\0';

    const char *p = pattern;
    for (const char *t = text; ; t++) {
        while (true) { // for starred sequences in pattern
            // at this point p always points after the last literal
            // character but possibly before the following star

            if (*p == '*') {
                if (last_pattern == '.' || *t == last_pattern) {
                    if (*t == 0)
                        return true;
                    break;
                }
                ++p;
            }

            last_pattern = *p;
            if (*p == '.' || *t == *p) {
                if (*t == 0)
                    return true;
                ++p;
                break;
            }

            if (*p == 0)
                return false;
            ++p;
            if (*p != '*')
                return false;
        }
    }

    return false; // never reached
}

The "for" loop here is interesting, without an exit condition. This is
because the matching of '\0' follows the common pattern, and is done in the
same loop code, with the only addition being the check for the end of lines
when we detect a match of the literals. Otherwise, if we'd checked the text
and pattern for EOL in the "for" loop condition, we'd have to repeat the
inner "while" loop once more after the "for" loop, to account for the case
when the pattern ends with a sequence of starred characters.

The inner loop is necessary to handle the sequences of multiple starred
characters, such as "a*b*c" matching the "c". If we don't do the loop, "c"
would get compared to "b" and the match will be considered failed. Since
this code is much smaller and simpler than the parsed-pattern version, this
becomes much easier to notice and implement - just add a loop!

Speaking of the stars, this version takes a different approach to the
improperly placed stars in the pattern, just because it was easier to do
this way. The leading star gets ignored, and the second star in a row gets
treated as a literal. Which is OK, since this is the behavior for the
undefined patterns (unless you're writing code that must be compatible with
some existing library and need to repeat its quirks).

This code also contains a bug. It happily matches the '.' followed by
anything in the pattern to an end of line in the text. To fix that bug, the
conditions of the form

if (*p == '.' || *t == *p)

need to be replaced with

if (*t == *p || *t != 0 && *p == '.')

There is another way to take the same general approach but keep the main
loop more classic, by using a function for the internal while loop, and
then this function can be easily called twice. It isn't any better or
worse, just different (and I've included the bug fix into it):

bool match_one(char &last_pattern, const char *&p, char c) {
    while (true) { // for starred sequences in pattern
        // at this point p always points after the last literal
        // character but possibly before the following star

        if (*p == '*') {
            if (c == last_pattern
            || c != 0 && last_pattern == '.') {
                break;
            }
            ++p;
        }

        last_pattern = *p;
        if (c == *p || c != 0 && *p == '.')
            ++p;
            break;
        }

        if (*p == 0)
            return false;
        ++p;
        if (*p != '*')
            return false;
    }

    return true;
}

bool match(const char *pattern, const char *text) {
    char last_pattern = '\0';

    const char *p = pattern;
    for (const char *t = text; *t != 0; t++) {
        if (!match_one(last_pattern, p, *t))
            return false;
    }
    if (!match_one(last_pattern, p, '\0'))
        return false;

    return true;
}

And as yet another variation of the same, you could also look ahead for the
stars after the pattern and save this state in a separate variable:

bool match(const char *pattern, const char *text) {
    char last_pattern = '\0';
    bool last_starred = false;

    const char *p = pattern;
    for (const char *t = text; ; t++) {
        while (true) { // for starred sequences in pattern
            // at this point p always points before the next literal
            // character

            if (last_starred) {
                if (last_pattern == '.' || *t == last_pattern) {
                    break;
                }
            }

            last_pattern = *p++;
            last_starred = (last_pattern != 0 && *p == '*');
            if (last_starred) // skip over the star
                ++p;

            if (*t == last_pattern
            || *t != 0 && last_pattern == '.') {
                if (*t == 0)
                    return true;
                break;
            }

            if (!last_starred)
                return false;
        }
    }

    return false; // never reached
}

I think that this version is much simpler and in the real world if I
started writing the previous version, I would have never finished it, and
switched to this one.

But even this version is not great. Having to keep state feels like too much
work. It would be much easier to go the other way around, iterate through
the pattern and consume the matching characters from the text. Then the
state will become embodied in the code rather than variables, and will be
simpler:

bool match(const char *pattern, const char *text) {
    const char *t = text;
    for (const char *p = pattern; *p != 0; p++) {
        if (p[1] == '*') {
            if (*p == '.') {
                while (*t)
                    ++t;
            } else {
                while (*t == *p)
                    ++t;
            }
            ++p; // adjust to consume the star
        } else if (*p == '.') {
            if (*t++ == 0)
                return false;
        } else {
            if (*t++ != *p)
                return false;
        }
    }
    return *t == 0;
}

This version is much smaller and much easier to follow through. It does an
explicit selection by the type of each pattern element, so each one of them
has its own code fragment and not the logic that is spread through the code
and mixed with the logic of the other elements. And all this makes the
creation of bugs more difficult.

A slightly special thing about this version is that it looks ahead by two
characters, not just one, to detect that the current pattern character is
followed by a star. It's not something that's usually taught but it makes
the code a lot easier.

This version also has a theoretical foundation: it's a recursive-descent
LL(1) parser of the text. Except that the regular expressions define a
non-recursive language, so there is no recursion. It really is perfectly
per textbook, you've just got to pick the right textbook! It also parses
not a fixed grammar but one given in the regular expression pattern. So
it's an LL(2) parser of the pattern, with the nested LL(1) parsers of the
matching substrings in the text. The 2 in LL(2) means that we're looking
ahead by 2 characters. It can also be parsed by an LL(1) parser as has been
shown before but looking ahead by 2 characters makes it easier.

It is the version that kind of came to my mind almost right away when I
first thought about this problem. But I can't really say that it just pops
in my mind out of nowhere. I do size up the different approaches in my
mind, and try the ones that look simpler first. It doesn't mean that this
first estimation is always right. Sometimes I go pretty deep along one
approach before deciding to abandon it, and applying the lessons learned to
another approach. And sometimes this another approach end up even worse,
but the lessons learned there help to get through the logjam on the first
approach.

So if you start with the poor approaches, you can still arrive at the
better ones by listening to the hints that the code gives to you as you
write it. When you see an easier way to go, use it. You can also power
through the difficult approaches to the successful end but that tends to be
much more difficult than switching the approach to a simpler one.

Thursday, October 25, 2018

how to reinvent the bicycle

I've been recently conducting some job interviews, and met a spate of junior engineer candidates with a similar issue: they could quickly come up with pretty good overall idea of the solution, and they can write the code as such, but they would fail to translate this overall idea into the code. They couldn't divide the overall idea into components and then step by step work through the details and interdependencies of these components and sub-components, even with the intense hinting from my side.

This is something that should be done almost mechanically, with little mental effort. And yet they could not do it, they try to do it by intuition, but their intuition is not strong enough to handle this complex problem, and they don't know how to use the mechanical approach either. The hints don't help much because they don't cause the right mechanistic associations.

The problem I ask is actually quite difficult, too difficult for a perfectly adequate junior-to-midlevel engineer, and I'm not sure if I myself would have solved it well some 20 years ago (I know that I can solve it now, it came from my real-life experience where I had to do this thing from scratch real fast). So I don't expect a good solution from this category of candidates, a so-so solution is plenty good enough for them (although some of them do very well, producing a fully completed optimal solution). But there is a difference between producing a complete solution that is not that good and getting the right overall idea, figuring out the conceptual parts that I consider difficult and important (that the first group never figures out), only to fail miserably to work out all the details necessary to write the code. Not that the code doesn't get produced at all (though sometimes it doesn't), but what gets produced has glaring holes and is much worse than the code produced by the first group. Interestingly, I see a good deal more women than men in this second group. The first group is, I think, about even between men and women; this sounds like a statistical impossibility but this second group is small compared to the first one and doesn't shift the overall averages that much. Granted, the sample size is small, so it might well be a statistical fluctuation, or maybe not.

I don't want to discuss the exact problem I use, because I don't want it to spread over the internets, making me invent another one. But I've come up with a good analogy that illustrates the issue. Some time ago I've read about the artists that would ask people to draw a bicycle, and then would produce a bicycle in this drawn shape as an art object. It's suitable to be only an art object because it's completely non-functional. If I were to draw a bicycle without thinking, I would also produce something like that. But spending some thought, any engineer should be able to reproduce a proper bicycle from the general logic: the function of the main components (wheels, steering, seat, pedals, chain) and the general considerations of the strength of the frame that shape it. The same logic can be used to check that none of the main components were forgotten: for example, if you forget about the chain, the pedals would be left disconnected from the rear wheel, so you'd have to recall it. Each component might be very non-trivial (the said chain took a long time to invent), but once you know the components, it should be impossible to put them into a wrong place. If an engineer figured out the chain but couldn't put it into the right place, there is something very wrong with this engineer.

The problem, I think, is that people are not really taught to do this kind of thinking in programming. The books and college courses describe the syntax of the programming languages and the general picture but leave a void between these layers. People learn this on their own from the examples and practice. But the examples and practice tend to train the intuition, and people are left to figure out the mechanistic approach on their own, and they either figure it out or they don't. It looks like quite a few of the generally smart people either don't or take a long time to develop it. Not to say that there is anything wrong with intuition, it's my favorite thing, but the mechanistic approach allows to stretch a good deal beyond the immediate reach of intuition, and to grow the future intuition.

I've recently seen a question on Quora, approximately "As you gain more experience, do you still write code that works but you don't know why?", and this I think is exactly the difference between the intuitive and mechanistic solutions. The intuition might give you some code that works, or possibly doesn't. The mechanistic approach lets you verify that what the intuition provided actually does what it's supposed to do, and provides the stepping stones for the intuition to go further: both to fix what is going wrong, and to do the more complex multi-leap designs.

By the way, the functional programming in Haskell and such seems to be an area where people are exhibiting this issue in a particularly exaggerated way: they write the high-level code hoping that the compiler will do the mechanistic work for them, and the compiler does but often not in the way they expect.

Programming is not the only area with this kind of a teaching problem. I think math has the same issue. The way the proofs of various theorems are taught is usually not the way how the authors had discovered them. These proofs get edited and adjusted a lot to make them look easier to understand. But this loses the teaching aspect of how to create the new proofs.

So, how would you teach it? Perhaps by walking through the solutions of the complex problems, showing step by step, how the initial general ideas get gradually solidified, deducing the requirements for their elements, and recursively for the sub-elements, resolving the conflicts along the way, with setbacks and retries.

The bicycle example suggests that there is probably a general transferable skill too, and this skill can be trained by the puzzle games like the classic "The Incredible Machine" where the goal is to build a Rube Goldberg contraption to accomplish the particular goal from a set of components. Like in the real life, the tasks there might include the extra components that look useful but don't really work out, or provide multiple ways to reach the goal. This is of course different from programming in the way that you need to achieve only one exact goal, while in programming you have to solve a whole class of related goals that include the corner cases, but still is a good point to start.

So the better way to teach for programming would probably be a combination of the "lecture material" and of the puzzles where the students have to build the working code from a number of provided hints.

I actually know an old book that does something like this: Jon Bentley's "Programming Pearls". It's built around a somewhat different idea but maybe it's close enough.

Thursday, October 11, 2018

simplicity is in the eye of beholder

I've been recently discussing a design with someone. We both wanted the design to be simple but disagreed about the details. Eventually I've realized that we wanted the different kinds of simplicity.

For him, the simplicity was in the compact protocol messages. And if the endpoint logic to handle them becomes arcane, so be it.

For me, the simplicity was in the simple logic of the endpoints, and if the messages become larger and exchanged more frequently, so be it. (There are of course cases where the bandwidth matters but here the control messages represented a minor overhead either way).

Monday, March 19, 2018

Comments and names

I've read an opinion that if the names of the methods are self-explanatory, the comments describing these methods are not needed.

Well, sometimes it's so, but in general I strongly disagree with this idea. The names and comments have differnt purposes. The point of the name is to give the method a designation that is difficult to confuse with the other methods. The point of the comment is to explain the meaning of a method. Mixing them doesn't do well.

The attempts to make the names more self-explanatory back-fire in two ways: the code becomes longer, and the names become less distinct. And the worst part, it still doesn't explain enough, you can't understand such an uncommented method until you read its implementation. A good comment must explain the various considerations that went into the development of the method, its pre-conditions and post-conditions, and the exact meaning of parameters. There is simply no way to fit it into a name.

In a way, a method name is similar to an acronym in the texts. Or, well, not necessarily an acronym but some short meaningful name. The first time you use this acronym or word, you explain what it means. When you define a method, you define what it means in the comment, and give it a short name. Then, as you use the word in the rest of the text, you use the method name in the program. And if the reader forgets what it means, he can always look up the explanation. If you used the full multi-word definition throughout your text, it would grow completely unreadable.

Saturday, February 4, 2017

The mythical 10x programmer, or the silver bullet explained

Recently I've been talking about some of the experiences from Aleri times when my interlocutor responded "so I guess you're one of those 10x programmers". This made me pause. Like any programmer per Larry Wall's recipe, I have a good amount of hubris but can I really say that I'm a 10x programmer? After some thinking, I guess sometimes I am and sometimes I'm not. This got me thinking further, about what was different between the situations when I was and when I wasn't. And I think the difference is very much about the organization and methodology, and thus can probably be applied to any programmers to improve their productivity. I've tried to distill the knowledge, and here it goes. By the way, Fred Brooks also called the goal of 10x productivity increase the "silver bullet", so here is maybe your silver bullet. Maybe. I'm not entirely sure that the recipe will work out of the box, it will probably need at least some debugging. Or maybe it won't work but it's still worth a try.

Let me start with describing a couple of cases when I could fairly be said to have been a 10x programmer. There were some other good cases too but on a couple of occasions I've actually had a benchmark.

One case was at Aleri, working on the CEP system. Here is the real short version of the events: At the time there were 3 main players in the CEP world: StreamBase, Aleri and Coral8, all having an approximate parity in their products. Around 2009 Coral8 apparently ran out of money, so Aleri used it this opportunity to consume it. And then it had two approximately equal products and set about merging them. The merging didn't go that great and Aleri ran out of money too, eventually consumed in 2010 by Sybase. See, even though the Coral8 product was about the same from the functional standpoint, it had the code base about 10x the size, and had maybe not quite 10x but a lot more people working on it. Since the Coral8 code base was so much larger and there were many more people familiar with it, it was taken as the foundation of the merged product. And this merging then crashed and burned. Another merging attempt was then done under Sybase, this time doing it the other way around, and it worked much better. So it's an interesting example where the 10x was there, then it wasn't, and then it was again.

Another case was the UnitedLinux. The full history of UnitedLinux is an interesting story of dog-eat-dog in the world of the Linux vendors but I won't go into the full funny details, just touch up on the important parts. One of conditions of the UnitedLinux consortium set by SuSE was that SuSE does all the general development. Everyone else does just their branding customizations and additional packages. So Caldera where I worked at the time (soon to be renamed SCO, with the SCO Linux brand of UnitedLinux) had closed the Linux development office in Germany, with most of the people heading to SuSE. When UnitedLinux was released, the various consortium members (I remember Connectiva, and whoever else) held their release parties and posted pictures with great many people. We didn't have any party at SCO: see, the SCO Linux was done by only two developers and half a tester (and a bit of a graphical artist), and Ron was in California while I was in New Jersey, so having a party would have been difficult. And we've had it ready earlier: the final image was sent to testing about 20 minutes after downloading the drop from SuSE. I think I still have the commemorative t-shirt somewhere, one thing Caldera was big on was the t-shirts.

So what was special in these cases? In a very short summary, I'd say that it was the compact code done with the familiar technologies, incrementally, with low interdependencies, and with high level of script automation. Oh, and you need to have the good engineers as well. But just the good engineers aren't enough: I wouldn't say that the engineers at Coral8 were any worse than at Aleri, just they've got mired in the slow methodologies.

Let's look at the details.

The compact code pretty much means the straightforward and to-the-point code. There is a classic book "The practice of programming" (and I liked it so much that I've patterned the name of my book after it) that shows the code like this: when you have a problem, solve it in the most straightforward way. There are lots of published systems of various code decorations, and all of them are wrong. The decorations don't make the code better, they make it worse, more difficult to read and modify. The extra layers of abstractions don't help anything either, they make the code more difficult to both understand and modify (not to mention writing it in the first place). There is an article about "worse is better" written by Richard Gabriel from the MIT AI labs when he discovered that the simpler Unix code worked better than his "better" over-abstracted code. But this phrase is misleading, better is better, it's just that the over-abstractions are worse, not better. This absolutely doesn't mean that the code should be flat abstraction-wise. The code that is too flat is also an unmaintainable spaghetti mess. The abstraction layers are needed. But they have to be introduced for a good reason, not on a whim and not by the bureaucratic rules. The good functions are usually pretty large. The good classes are usually pretty large. The good rule of thumb for making a chunk of code into a new abstraction is to notice when you do the same thing over and over again. Then it's time to put this code into one place and call it from the other places. A good abstraction reduces the size of code, does not increase it.

The problem with the overly-abstracted code is really the same as with the under-abstracted flat code: there are only so many things you can keep in your head at a time. In the code that is too flat you need to keep too many interdependencies of that flat level in your head. In the code that is too abstracted you need to keep too many abstractions and their APIs and calling patterns in your head. The optimal spot is in the middle.

A lot of things are easier to just do than to use a library. If you do the logging from a shell script, a 10-line shell function will not only work much better than a shell version of log4j but will also be a lot easier to call (not to mention avoiding an extra dependency). Or an iteration over a container with a plain for-loop is not only easier to write and understand but also much easier to remember than the STL templates for that.

An important thing is to keep the documentation close to the code. This means both the good comments in the code and the user documentation.

The basic rule behind the good comments is very simple: the comments must be one (or sometimes more) step more abstract than the code. A comment should be explaining either why something is done (or not done), especially if the reasoning is not obvious, or telling what is done as a short summary in the higher-level terms. As long as this rule is followed, the more comments is pretty much the better (but there is only so much you can write without breaking this rule).

Keeping the user documentation together with the code, in the same versioning system and in the same tree hierarchy is very important. This way the documentation can be updated in parallel with the code and will never become obsolete. And at the end of the release cycle all the changes can be found and re-edited. This principle implies that the documentation must be kept in a text-based format that is easy to diff, just as the code is, and that the tools used on it must not change that exact text format at random. Which generally means that the editing should be done with a normal text editor, and the WYSIWYG tools should go into the garbage bucket. The pain inflicted by them is much greater than the gain. The ease of versioning and diffing is much more important than the ease of visual formatting. While on the subject of documentation, the technical writers need to either really be technical or be kept in check. The precise meaning of documentation is important, and I've seen too often an edit to make a sentence smoother also making it ambiguous or entirely meaningless. And "smoother" is not always "better" either. I've bought a book of Richard Feynman's letters, more for the collection purposes since the letters are usually fairly boring but unexpectedly it also turned out to be a great reading. Except for the part where it included the magazine articles about his Nobel prize. The style of these articles was completely different from Feynman's, it was the usual unreadable journalist drivel. Don't turn your documentation into the journalist drivel.

To give another example of where I think smooth is bad, the sentence above "An important thing is to keep the documentation close to the code" is better in that place than "Keeping the documentation close to the code is important". The complex sentence slows you down and gives you a breather before the text switches to a new subject, emphasizing that switch. Also, it brings the word "important" closer to the start of the sentence where it's more noticeable.

Next principle, the familiar technologies, affects the speed of development a lot. It's kind of obvious but the counter-point is "if we use that latest and greatest new tool, we'll be done much faster". And sometimes the latest and greatest tool really does that. But most of the time the familiarity is much more important. If you have an idea of how the goal can be straightforwardly achieved with the familiar tools, that's usually the best bet by a wide margin. If not, learn the new tool. Sometimes this would also give you an idea of how the goal can be achieved easier with your old tools. And a lot of technologies are just crap anyway. So the overall priority should be not "the goal for the tool" but "the tool for the goal".

This doesn't mean that the new tools shouldn't be learned or that everyone has to be familiar with every tool to be used. Usually one person having the familiarity with a tool is enough to frame the code around it, giving everyone else a big head start in learning it. It's much easier to add code to an existing unfamiliar framework than to build a new framework with an unfamiliar tool. And there are the times to use the familiar tools and times to learn the new ones. There is the cadence of development when the things start amorphous and slow, and then as the solutions for various items start falling together the project starts crystallizing and picking up the pace until it rolls at full steam to the goal. Then the next cycle starts. The slow part of the cycle is a good time to learn the new tools. But things tend to go a lot faster when only one thing is new: a new tool for an old purpose or an old tool for a new purpose. Of course sometimes a bullet has to be bitten and a new tool learned together with the new purpose but that goes slower.

On to the incrementality. It's nothing new, Eric Raymond wrote about it in "The Cathedral and the Bazaar". But it still gets lost surprisingly often, even in the "agile methodologies". Instead of "we'll make this great new thing to replace that dingy old thing" the right approach is "we'll transform this dingy old thing into this great new thing". The programs are not physical objects, they're more like blueprints for the physical objects. It's not like you have to raze a bridge to build a new bridge in its place. Instead a program's run is more like a physical object. Every time you restart a program an old bridge gets razed and a new one gets built for you according to the new plan. Or I guess another good example from the world of the physical things is the automobile manufacturing: the car models might stay but their internal mechanics is updated pretty regularly. At some point the cars would look the same but the new cars from that point on would get say a better alternator. The Chevy small block V8 engine has been around for more than 50 years. The only item that stayed for all this time is the dimensions of the cylinders. Everything else has been replaced. But it hasn't been replaced in one go, it has been a product of a long evolution with many gradual small changes.

Every time you hear about an order of magnitude budget and time overruns, it's when someone tried to re-do a program from scratch. Fred Brooks talks about this in his "Mythical man-month", so the effect has been known since at least 1970s, and yet it keeps being repeated. And guess what, that's not limited to software either. Have you ever heard of Tucker? It was a new automobile company in 1940s that decided to design its car completely from scratch, with all the newest and greatest technology. They've never got the cars working, they kept falling apart until the company ran out of money, and its founder got sued by the government for all the dubious financial activities he'd done to delay the running out of money. Here someone might ask "but what about the iPhone, what about Tesla?" They are actually the examples of the evolutionary development. iPhone was built around the clever idea of a big-screen phone used as an application platform but it wasn't developed all at once, and Apple didn't try to reinvent the cell communications chipsets at the same time. Some of the evolutionary moments of iPhone are widely known: such as, they started with a stylus as was then typical, and then Steve Jobs said "screw this, use the fingers instead". And the last-moment decision of replacing the plastic screens with glass had been a subject of many moralizing articles. Tesla is best seen in the contrast to GM's EV1. GM tried to develop the whole technology chain from scratch, spent a huge amount of time and money, and failed. While Tesla piggybacked on the already existing technologies and packaged them together into a car (that they also took off the Lotus's shelf). Even then Tesla's path wasn't easy, they've had their tuckeresque moments with the failing transmissions (which they eventually simply got rid of) and with the battery cooling for a few years before they were able to release their first car.

One of the worst statements I've heard about why a system needed to be rewritten from scratch went like this: "it has accumulated all these little features that make making changes without breaking them difficult, so we have to re-write everything to get rid of them and make the changes simple again". It completely misses the point: it's these little features that make the product worthwhile. Doing the 80% of the functionality is easy but nobody needs only 80% of the functionality (and in reality that particular project overran by years). The code is a living document that preserves the knowledge of the subject area. Discarding the code means discarding this knowledge which then has to be re-learned. This doesn't mean that the best solution isn't sometimes to throw away the old code and re-write it from scratch. But it's usually done by replacing the code in small chunks and checking that the functionality is preserved after the move of every chunk. To give an analogy from biology, the human (and animal) organs develop by growing the new cells within a scaffolding of the other neighboring cells that shapes them and makes them properly differentiate for the function. The old code provides such a scaffolding for the new chunks.

I want to illustrate this last point with a couple of examples. I'll leave the company and all the details in the first example unnamed, since the example is kind of embarassing. This company decided to make an appliance with its software. It had all the software bits and pieces that it has been selling to the customers, so the only thing that needed to be done was put it all together, like the customers do. Simple, right? Turns out, no, the bits and pieces would not work together, and no customer had actually put them all together because they didn't work. Each piece worked in a way, so each responsible team was happy, but not together. And getting them to work together took a large amount of time and work. Another similar example comes from Microsoft where I've seen a very similar issue with the PowerShell support in the Windows clusters: there are many components involved and each one has its own PowerShell commandlets for management but they don't mesh well together with each other, need to be combined in the very non-obvious ways with no big goal-oriented commandlets, and some things turn out to be not really doable at all.

The obvious objection is "but how do we test all these changes"? This is where the automated testing comes in, and it's such a big subject that I wrote a whole separate post about that. But it's really a bogus question. When you write a completely new system,you still test each chunk of code as you write it, and if you don't have the automated testing then you just test it manually. The difference is that in an incomplete system your chunk will have only the limited interactions, so you would be able to test only some of its effects. In a complete system you get a much more complete coverage. And if the old system doesn't have any automated tests then it's a good way to start them: for each chunk write the tests before replacing it, and then use them to check that the the replaced chunk still works in the same way. Another important point, a chunk doesn't have to be all in one place, it might be distributed, such as an one-line change in a thousand files.

The small changes don't have to go together with the frequent releases. But every time you do a release one way or another, you get the issue "we have to complete all the changes before the release!" and it gets worse if you want to do the more frequent releases. Well, you definitely do want to make sure that the things don't get accidentally broken at the last moment, so the commits have to slow down just before the release. But it doesn't mean that you can't include the changes at all. Any change is fine to include as long as it doesn't break the existing functionality. The new functionality doesn't have to work, as long as it doesn't get called in the production environment nobody cares about that, it only has to not break the old functionality. This way you can integrate the changes a lot earlier, and then spend a lot less effort merging them together.

In the post on testing I've already said that an important part of the incrementality is the incremental tests, being ability to run easily the immediate subset of tests affected by a change. But a prerequisite for it is the incrementality of the build: the ability to build quickly all the components dependent on the change, all the way to the final package, while skipping the build of the unchanged parts of the tree. "Quickly" here means right away, not in some kind of a nightly build. This does wonders for the turnaround time. Obviously, in some cases, like a fundamental library, pretty much all of the tree will depend on it. But then the incrementality should be interpreted as the ability to build a few of the representative projects based on it to test that they didn't break. If the full nightly test shows that some dependent components got broken, the ability to rebuild them quickly also makes wonders for the turnaround of the investigation.

Another thing about the build is that I think the purely rules-based builds are bad. It's not that having the common rules is bad, they're actually quite convenient. But the build system must also allow to do the quick one-offs. These one-offs are very important for the automation of the build. They allow to write quickly say a small Perl script that would do some kind of transformation or code generation. A build system without such one-offs turns the hour-long tasks into the week-long ones. So just please do support the full proper make and the scripting environment, not some crap like Cmake. The counter-argument is of course "but we need to build on multiple platforms including Windows". Yeah, building on Windows sucks, and the only decent solution I see it to bring a decent environment of the portable tools to each of your build platforms.

Another very simple but important thing about the build is that it must detect the errors. The Java-based tools in particular are notorious for ignoring the status codes, so when something breaks, the only way you know it is by the result files being absent. Or not even by them being absent but them failing in the tests. This is bad. All the errors have to be detected as early as possible.

This looks like a good place for another related rant, on reliability. Basically, when something works it should work reliably. Einstein said "The definition of insanity is doing the same thing over and over again and expecting different results." But in reality we don't control everything, and when these hidden changes cause a different result of the same actions, it's just maddening and time-consuming. The good code, including the tests and build, should work reliably. If the underlying systems are known to have outages, it should retry and still work reliably on top of them and not propagate this unreliability up. Every step the unreliability propagates makes it more difficult to fix, so it's best nipped in the bud. There of course are the reasonable limits too, the calls should not get stuck retrying forever, and probably not ever for an hour. Although retrying for an hour might be fine for a scheduled nightly run, but for a manual run it's way too long. And when a program is retrying it should not be silent, it should be telling what its is doing.

Next item is the low interdependencies. Basically, it means that the people should be able to work without stepping too much on each other. You might ask: but doesn't the principle of the small code base go against it? More code should allow more people to work on its part, right? No, it doesn't work like this. The catch is that if you have more code solving the same problem, any changes to it mean that you have to change about the same percentage of the code. This is why a sprawling code base is slow in development, because any change requires changing a lot of code. But this is also why it doesn't allow more people to work on it in parallel. So the corollary is that if you add too many people to a project, they will start stepping on each other a lot and the productivity will drop.

The other major kind of interaction is between the people in general. If you need to do a 5-minute task, it takes you 5 minutes. But if you need someone else to do a 5-minute task, it's likely to take you a day of interaction. Which still might be faster than learning how to do it yourself but if the task is repetitive then this becomes a major drag. To work fast, this drag has to be reduced and eliminated whenever possible. One of the consequences is that the posessive code ownership is bad. Now, having someone knowledgeable available to talk about a part of code is good, and it also helps to discuss the general directions of the code to avoid breaking things for other people. But there should not be a single-person or single-team bottleneck for changes to a certain part of code, nor any single person controlling what is allowed to go in. There are other reasons to avoid these things too but even the speed of development should be a sufficient reason. Another corollary is that the strict code style requirements are bad. The small style variations just don't matter, and the enforcement is a lot of pain with no gain whatsoever.

Another case where the excessive interaction shows is planning. Have you seen these beautiful Gantt charts and critical path diagrams, with the carefully planned moments of interdependencies? Well, they're all garbage. They might be an interesting exercise on the way to the better understanding of the project but things never work out like this in reality. In reality all the interdependency points end up as a mess. So there is no point in overplanning. Instead I really like the saying of my past manager: "We can plan an aggressive schedule because we don't mind it slipping once in a while". When combined with the idea of the incrementality, this points to the following concept: the detail level of the plans must be tied to their distance. A detailed planning of the far-off items doesn't make sense because they are sure to change before we get there. These items should be on the list, so that we can keep track of the direction, but only in the general terms. Only the very near-term items need to be planned in detail, and the farther in the future the less detail is needed. Though the very near-term items have their Heisenbergian uncertainty as well. In my experience the minimal unit of time for planning is a week. If you have five items that you expect to take one day each, they will never work out like this. Some of them will take an hour and some will take three days. But if you lump all five together, they will take together a week with a decent amount of certainty.

Since I've already talked about the scripted automation throughout the text, this is a wrap for this post. The things I've described look kind of small but they really are important, and I believe that taken together the do amount to the 10x difference in productivity. What do you think?

P.S. Someone else's thoughts on the subject of 10x: http://antirez.com/news/112

Sunday, August 9, 2015

something old

Hey, it turns out that the patent for the dynamic schema modification that I worked on back in the Aleri days got actually issued, way back in 2009: http://www.google.com/patents/US20090037769

I've totally missed it and thought that it was still mired in the process. Looks like it took only 8 months from the final filing to the issue.

Wednesday, June 24, 2015

On simplicity

Today someone brought to my attention the quote from http://blog.rongarret.info/2009/02/css-and-meaning-of-life.html :

<<And then there is one element of my quality metric which seems to be at the heart of the controversy: I believe that computers are meant to serve people and not the other way around. That means that if something
inherently simple is difficult to do, it's wrong.

Not everyone agrees with this. A surprising number of people like things to be difficult. I have never understood this, but it is a fact. Some people revel in the intellectual studliness of mastering, say, C++
template programming or x86 assembler. I don't. I think C++ is a pain in the ass. Life is short, and I would rather spend my allotted time skiing or making love to my wife than worrying about whether or not I need to define a virtual destructor in my base class in order to avoid memory leaks. And that means that when I want, say, for something to appear centered on the screen, I want to be able to do it by typing "center" rather than "margin:auto". If I can't tell a computer to center something by saying "center", then as far as I'm concerned the computer is wrong, not me.

Like I said, not everyone buys in to this worldview. I don't get it, but it is surprisingly common. I think it might have something to do with job security. I first encountered the complicated-is-beautiful mindset when I was working for NASA. I had an idea for how to simplify spacecraft sequencing and make it less error prone. I figured everyone would love it. To the contrary, the establishment fought the idea tooth-and-nail (and still does to this day). And this attitude is everywhere. It's the reason the tax code is so fucked up. It's the reason the law is so byzantine. It's the reason that languages like C++ and Perl thrive.>>

I see a major issue with this kind of thinking. Now, let me get this straight: I don't normally do the web stuff, so i don't have a whole lot of opinion about CSS as such. For what opinion I have, I agree with Ron that CSS sucks and personally I like to see everything in one file without artificially splitting it. Nothing is worse than dealing with lots of small files with cross-dependencies. For what it's worth, I hate JavaScript and the web sites written with it, why do they have to be so slow and annoying?

But let's return to the subject, about the simplicity. I think there is more than one failure on Ron's part.

First of all, he seems to miss that everyone has a different definition of simplicity. And that applies not only to him but to many other people, and not only to the simplicity but to many other subjects. There are many people who like to quote the Principle of the Least Surprise. But then they go and propose the solutions that surprise me immensely.

My favorite example is a poll where people were asked, whether they think that they're better drivers than average, and more than 80% of them answered "yes". And then the article about that poll went on to talk about how people overestimate themselves. What the article failed to appreciate though is that everyone has a different definition of what a "good driver" is. Some people think that it means driving faster, some think that it means driving slower, some think that showing the turn signals is important, some don't know about the existence of the turn signals. Sure enough, most people strive to be good drivers by their own definition, and really become the good drivers by their own definition. The remaining 20% must have been the really lousy drivers.

So in my reckoning C++ and Perl are the simple languages. They allow for simple and easy writing and reading of the code. Yeah, C++ has some strange points, like why not make these virtual destructors mandatory if you define the virtual methods? And the early C++ really was a pain. But after the addition of templates it became awesome. It's not that you want to use the templates for everything, and you specifically don't want to use the stuff from the Alexandrescu book in any actual project. And they really need the explicit loops instead of the stupid functional recursion, and a better error diagnostics. But when they're used right, they're awesome. The modern C++ really combines the simplicity aspects of C and Perl. And Perl is an awesome simple language. It's driven by how to make things easy and convenient to use.

See, there are at least two definitions of simplicity. One definition thinks that simplicity equals logic and consistency, that everything that is logical and consistent is simple. The other definition thinks that things should be easy to use. The typical problems need to be solvable with minimal effort, and if that requires inconsistency, so be it, the consistency is of not much use by itself. If we look at the classic programming languages, C is convenient while Pascal is all-consistent (and oh so much pain to use). Perl is convenient, Python is consistent (and oh so much pain to use, have you ever tried to read the code in Python?). C++ is convenient, Java is consistent (and is just incredibly painful to use, it's like raising the sun with manual labor). Shell is convenient, PowerShell is consistent (and well, you've got my drift).

Second, the wrong kind of simplicity is bad. During my time at Google, I've actually heard with my own ears someone say: "The system has got so many features to it that it became difficult to modify, so we had to throw it away and rewrite from scratch without the extra features". This is I think a pinnacle of how things can go wrong. Yes, a system without many features is much easier to write. And probably keeping just 80% of the features would make is much easier to maintain. But there is a good reason for the features, they make the system easy to use. Nobody needs a system with 80% of the features because it becomes unusable. And well, if someone can't understand and maintain it, well, they're probably the wrong people to do the job.

That said, of course there is a limit: there are features that make the life of the users convenient and the features for the sake of features. The first kind is good, the second kind is bad, and telling which is which may sometimes be pretty hard. It might not even be that someone (er, marketing) had included the features for the sake of features, they might have been a good idea at some point and then got supplanted by the better ways. But things evolve, and sometimes there is time for some features to die. But not wholesale and not by the principle of simplicity of implementation or for some dumb consistency. The best and most simple-to-use features are often difficult to implement and full of quirks. But that's because the tasks they're solving are difficult and full of quirks, and the features make these tasks easy by covering the complexity and adopting to the quirks.

Wednesday, April 22, 2015

On performace of reading the unaligned data

I've made a template that was supposed to optimize the reading of the unaligned data:

template <typename T>
T getUnaligned(const T *ptr)
{
    if ((intptr_t)ptr & (sizeof(T)-1)) {
        T value;
        memcpy(&value, ptr, sizeof(T));
        return value;
    } else {
        return *ptr;
    }
}

And I've been wondering if it really makes a difference. So I've made a test.The test compares 5 cases:

Just read the aligned memory directly;
Read the aligned memory through memcpy;
Read the aligned memory through memcpy;
Read the aligned memory through the macro;
Read the unaligned memory through the macro.

All the cases follow this general pattern:

UTESTCASE readAlignedDirect(Utest *utest)
{
    int64_t n = findRunCount();
    int64_t x = 0;

    double tstart = now();
    for (int64_t i = 0; i < n; i++) {
        x += apdata[i % ARRAYSZ];
    }
    double tend = now();
    printf("        [%lld] %lld iterations, %f seconds, %f iter per second\n", (long long) x, (long long)n, (tend-tstart), (double)n / (tend-tstart));
}

The results on an AMD64 machine with -O3 came out like this:

   case 1/6: warmup
        [0] 4294967296 iterations, 4.601818 seconds, 933319757.968194 iter per second
case 2/6: readAlignedDirect
        [0] 4294967296 iterations, 4.579874 seconds, 937791527.916276 iter per second
case 3/6: readAlignedMemcpy
        [0] 4294967296 iterations, 4.537628 seconds, 946522579.007429 iter per second
case 4/6: readUnalignedMemcpy
        [0] 4294967296 iterations, 6.215574 seconds, 691000961.046305 iter per second
case 5/6: readAlignedTemplate
        [0] 4294967296 iterations, 4.579680 seconds, 937831317.452678 iter per second
case 6/6: readUnalignedTemplate
        [0] 4294967296 iterations, 5.052706 seconds, 850033049.741168 iter per second

Even more interestingly, the code generated for readAlignedDirect(), readAlignedTemplate() and readUnalignedTemplate() is exactly the same:

.L46:
    movq    %rax, %rdx
    addq    $1, %rax
    andl    $15, %edx
    addq    (%rcx,%rdx,8), %rbx
    cmpq    %rbp, %rax
    jne .L46

The code generated for readAlignedMemcpy() and readUnalignedMemcpy() is slightly different but very similar:

.L34:
    movq    %rax, %rdx
    addq    $1, %rax
    andl    $15, %edx
    salq    $3, %rdx
    addq    (%rcx), %rdx
    movq    (%rdx), %rdx
    addq    %rdx, %rbx
    cmpq    %rax, %rbp
    movq    %rdx, (%rsi)
    jg .L34

Basically, the compiler is smart enough to know that the machine doesnt' care about alignment, and generates memcpy() as a plain memory read, and then for condition in the template it's also smart enough to know that both branches end up with the same code and eliminate the condition check.

Smart, huh? So now I'm not sure if keeping this template as it is makes sense, or should I just change it to a plain memcpy().

BTW, I must say it took me a few attempts to make up the code that would not get optimized out by the compiler. I've found out that memcpy() is not suitable for reading the volatile data nowadays. It defines its second argument as (const void *), and so the volatility has to be cast away for passing the pointer to it, and then the compiler just happily takes the whole memory read out of the loop. And it was even smart enough to sometimes replace the whole loop with a multiplication instead of collecting a sum. That's where the modulo operator came handy to confuse this extreme smartness.

A separate interesting note is that each iteration of the loop executes in an about one-billionth of a second. Since this is a 3GHz CPU, this means that every iteration takes only about 3 CPU cycles. That's 6 or 10 instructions executed in 3 CPU cycles. Such are the modern miracles.

Monday, May 27, 2013

snapshot 1.0.92 and the release plans

I've released the new snapshot, 1.0.92, that contains all the most recent multithreading code. There is one more big example to go, and then the docs to be collected from the blog into the proper manual (and a bunch of them to be written yet), but other than that it's pretty much a preview of the next release.

Speaking of which, it has been clear for a while that the next release has overgrown the 1.1 designation. I should have done a couple intermediate releases but the multithreading looks like a very important feature, and I've been pushing towards that.

Hereby I officially proclaim that the next release will be 2.0, and in honor of that all the future blog posts about it will be tagged 2_0.

Thursday, April 4, 2013

the Triead lifecycle

Each Triead goes through a few stages in its life:

declared
defined
constructed
ready
waited ready
requested dead
dead

Note by the way that it's the stages of the Triead object, the OS-level thread as such doesn't know much about them, even though these stages do have some connections to its state.

These stages always go in order and can not be skipped. However for convenience you can move directly to a further stage, this will just automatically pass through all the intermediate stages. Although, well, there is one exception: the "waited ready" and "requested dead" stages can get skipped on the way to "dead". Other than that, there is always the sequence, so if you find out that a Triead is dead, you can be sure that it's also declared, defined, constructed and ready. The attempts to go to a previous stage are silently ignored.

Now, what do these stages mean?

Declared: The App knows the name of the thread and that this thread will eventually exist. When an App is asked to find the resources from this thread (such as Nexuses, and by the way, the nexuses are associated with the threads that created them) it will know to wait until this thread becomes constructed, and then look for the resources. It closes an important race condition: the code that defines the Triead normally runs in a new OS thread but there is no way to tell when exactly will it run and do its work. If you spawned a new thread and then attempted to get a nexus from it before it actually runs, the App would tell you that there is no such thread and fail. To get around it, you declare the thread first and then start it. Most of the time there is no need to declare explicitly, the library code that wraps the thread creation does it for you.

Defined: The Triead object has been created and connected to the App. Since this is normally done from the new OS thread, it also implies that the thread is running and is busy about constructing the nexuses and whatever its own internal resources.

Constructed: The Triead had constructed and exported all the nexuses that it planned to. This means that now these nexuses can be imported by the other threads (i.e. connected to the other threads). After this point the thread can not construct any more nexuses. However it can keep importing the nexuses from the other threads. It's actually a good idea to do all your exports, mark the thread constructed, and only then start importing. This order guarantees the absence of deadlocks (which would be detected and will cause the App to be aborted). There are some special cases when you need to import a nexus from a thread that is not fully constructed yet, and it's possible, but requires more attention and a special override. I'll talk about it in more detail later.

Ready: The thread had imported all the nexuses it wanted and fully initialized all its internals (for example, if it needs to load data from a file, it would do that before telling that it's ready). After this point no more nexuses can be imported. A fine point is that the other threads may still be created, and they may do their exporting and importing, but once a thread is marked as ready, it's cast in bronze. And in the simple cases you don't need to worry about separating the constructed and ready stages, just initialize everything and mark the thread as ready.

Waited ready: Before proceeding further, the thread has to wait for all the threads in App to be ready, or it would lose data when it tries to communicate with them. It's essentially a barrier. Normally both the stages "ready" and "waited ready" are advanced to with a single call readyReady(), the thread says "I'm ready, and let me continue when everyone is ready". After that the actual work can begin. It's still possible to create more threads after that (normally parts of the transient fragments), and until they all become ready, the App may temporarily become unready again, but that's a whole separate advanced topic.

Requested dead: This is a way to request a thread to exit. Normally some control thread will decide that the App needs to exit and will request all its threads to die. The threads will get these requests, perform their last rites and exit. The threads don't have to get this request to exit, that can also always decide to exit on their own. When a thread is requested to die, all the data communication with it stops. No more data will get to it through the nexuses and any data it sends will be discarded. It might churn a little bit through the data in its input buffers but any results produced will be discarded. The good practice is to make sure that all the data is drained before requesting a thread to die. Note that the nexuses created by this thread aren't affected at all, they keep working as usual. It's the data connections between this thread and any nexuses that get broken.

Dead: The thread had completed its execution and exited. Normally you don't need to mark this explicitly. When the thread's main function exits, the library will do it for you. Marking the thread dead also drives the harvesting of the OS threads: the harvesting logic will perform a join() (not to be confused with SQL join) of the thread and thus free the OS resources. The dead Trieads are still visible in the App (except for some special cases with the fragments), and their nexuses continue working as usual (even including the special cases with the fragments), the other threads can keep communicating through them for as long as they want.

Thursday, March 28, 2013

Triceps Multithreading Concepts

The multithreading support has solidified to the point where I can start documenting it. The full description will have to wait until I finalize the Perl API but the concepts are pretty much settled.

The idea of the multithreading support in Triceps is to make writing the multithreaded model easier. To make writing the good code easy and writing the bad code hard. But of course you don't have to use it, you can always make your own if you wish (just as you could before now).

Without the further ado, the diagram of a multithreaded Triceps application:

Fig. 1. Triceps application.

The Triceps application is embodied in the class App. It's possible to have multiple Apps in one program.

Each thread has multiple parts to it. First, of course, there is the OS-level (or, technically, library-level, or Perl-level) thread where the code executes. And then there is a class that represents this thread and its place in the App. To reduce the naming conflict, this class is creatively named Triead (pronounced still "thread"). In the discussion I use the word "thread" for both concepts, the OS-level thread and the Triead, and it's usually clear from the context which one I mean. But sometimes it's particularly important to make the distinction, and then I name one or the other explicitly.

The class Triead itself is largely opaque, allowing only a few methods for introspection. But there is a control interface to it, called TrieadOwner. The Triead is visible from the outside, the TrieadOwner object is visible only in the OS thread that owns the Triead. The TrieadOwner manages the thread state and acts as the intermediary in the thread's communications with the App.

The data is passed between the threads through the Nexuses. A Nexus is unidirectional, with data going only one way, however it may have multiple writers and multiple readers. All the readers see the exact same data, with rowops going in the exact same order (well, there will be other policies in the future as well, but for now there is only one policy).

A Nexus passes through the data for multiple labels, very much like an FnReturn does (and indeed there is a special connection between them). A Nexus also allows to export the row types and table types from one thread to another.

A Nexus gets connected to the Trieads to though the Facets. A Facet is a connection point between the Nexus and the Triead. Each Facet is for either reading or writing. And there may be only one Facet between a given Nexus and a given Triead, you can't make multiple connections between them. As a consequence, a thread can't both write and read to the same Nexus, it can do only one thing. This might actually be an overly restrictive limitation and might change in the future but that's how things work now.

Each Nexus also has a direction: either direct ("downwards") or reverse ("upwards"). And yes, the reverse Nexuses allow to build the models with loops. However the loops consisting of only the direct Nexuses are not allowed, nor of only reverse Nexuses. They would mess up the flow control. The proper loops must contain a mix of direct and reverse Nexuses.

The direct Nexuses have a limited queue size and stop the writers when the queue fills up, until the data gets consumed, thus providing the flow control. The reverse Nexuses have an unlimited queue size, which allows to avoid the circular deadlocks.

Normally an App is built once and keeps running in this configuration until it stops. But there is a strong need to have the threads dynamically added and deleted too. For example, if the App running as a server and clients connect to it, each client needs to have its thread(s) added on connection and then deleted when the client disconnects. This is handled through the concept of fragments. There is no Fragment class but when you create a Triead, you can specify a fragment name for it. Then it becomes possible to shut down and dispose the threads in a fragment after the fragment's work is done.

Sergey Babkin on CEP and stuff