After a long break, Triceps 2.1.0 is finally released: http://triceps.sourceforge.net/

It includes two important features:

1. The fast compiled Ordered index. The code for it has been sitting in SVN for three years, and I have finally found time to update the documentation to replace the old ordered index implemented in Perl with the new one.

2. Triceps now builds works with the modern versions of C++ and Perl. This was harder to do than it sounds, since the C++ standard has changed in an incompatible and unpleasant way. But the good news is that the code produced by the new compiler is a good deal faster than by the old one.

# Sergey Babkin on CEP

My thoughts on the field of Complex Event Processing. Since I'm a technical guy, they go mostly on the technical side. And mostly about my OpenSource project Triceps. To read the Triceps documentation, please follow from the oldest posts to the newest ones.

## Tuesday, December 25, 2018

## Friday, December 7, 2018

### Principal component analysis

Today I've learned about the existence of the Principal component analysis. It's supposed to find the orthogonal dimensions that best describe the variance of the data set. Perhaps this is the real answer for removing the duplication in the dimensions that I've been thinking about some time ago. And it had been invented in 1901! I haven't understood it yet, will need to read up on it. For now just making a note so that I won't forget about it.

## Saturday, December 1, 2018

### Triceps in flux

I've finally updated my machine to a modern version of Linux (Mint 19), and now I'm going through, so to say, bringing Triceps into the Century of the Fruit Bat. Both C++ and Perl languages have changed, (and as has been mentioned in a previous post, Ghostscript had functionality removed).

Most annoyingly, C++ doesn't allow to call even the non-virtual methods on the NULL pointers any more. Well, it does allow to call them, but auto-removes any checks for (this == NULL) inside the methods. And I've been using this in multiple places. But now it's all worked around by using the special static objects and the code builds again.

The tests still fail though, but now it's hopefully all trivial: some message got changed because of the different hashing, and also the Perl tests fail because the syntax of regexps got changed in Perl 5.26, it doesn't accept the plain "{" in the patterns any more. But I'm working on it.

Most annoyingly, C++ doesn't allow to call even the non-virtual methods on the NULL pointers any more. Well, it does allow to call them, but auto-removes any checks for (this == NULL) inside the methods. And I've been using this in multiple places. But now it's all worked around by using the special static objects and the code builds again.

The tests still fail though, but now it's hopefully all trivial: some message got changed because of the different hashing, and also the Perl tests fail because the syntax of regexps got changed in Perl 5.26, it doesn't accept the plain "{" in the patterns any more. But I'm working on it.

## Friday, November 23, 2018

### converting EPS to SVG

I've tried to build my book on a recent Linux Mint and failed: I've been using Ghostscript to convert the figures in EPS to the SVG format, for inclusion into PDF, and it turns out that Ghostscript doesn't support SVG any more. I've tried to download and build the source but the source didn't have it either. I remember, it got broken a long time ago, it was crashing on my previous attempt to use a newer Linux. I've found a bug report from 2013 on Ghostscript's web site. Apparently, instead of fixing the SVG driver, they've just dropped it.

But I was able to find another way, with Inkscape. The conversion command looks like this:

inkscape -D "$SRC_EPS" --export-plain-svg="$DST_SVG"

After this change the build worked like a charm.

The Triceps build will need a similar change. I'm about to try it.

But I was able to find another way, with Inkscape. The conversion command looks like this:

inkscape -D "$SRC_EPS" --export-plain-svg="$DST_SVG"

After this change the build worked like a charm.

The Triceps build will need a similar change. I'm about to try it.

## Wednesday, November 7, 2018

### how to reinvent the bicycle 2

I've been thinking more about the problem of teaching the engineering

approach to problem solving, and I've decided to try using another

interview problem as an example. It's an older problem, and I really like

it, but I don't use it any more because the other people liked and used it

too, and the solutions are already available on the Internet.

I'm not just going to give the solution, blindly remembering that is of no

use to anyone. I want to show how the better solutions can be born out of

the bad solutions. I'm going to start with the worst solution I can think

of, and then gradually show the better solutions. The puzzle for you, the

reader, is to use the issues in these solutions as hints towards

the better solutions, that would take you as far ahead as possible.

I wrote those solutions as I would do at an interview, without actually

compiling and running the code on a computer. So they might contain bugs,

from trivial bugs to the major ones. Whenever I've noticed a major bug

after writing a solution and following through its execution manually, I've

left the buggy version in, and added the fixed version. Sometimes a solution

was so overcomplicated that I didn't notice a bug in it until I moved on to

another, simpler solution. There still might be bugs, but hopefully not too

bad ones. It just shows that the overcomplicated approaches should be

avoided.

The problem is to write a matcher for the very simple regular expressions,

that include only the operators "." (any character) and "*" (any number of

repeats of the previous character). The "*" is greedy, consuming as many

matching characters as possible. There is no escape character like

backslash. The string must match completely, as if the regexp implicitly

had "^" at the font and "$" at the end.

For the Windows people who don't know regular expressions, the regexps are

a way of pattern matching, kind of like the file patterns (which are known

in the Unix world as globs) but different. Most of the characters in the

regexps have the literal meaning, just as in the file patterns. The

character "." is like "?" in the file patterns, matching exactly one any single

character. The character "*" is different than in the file patterns. It

means zero or more repetitions of the previous element, so the file pattern

"*" is analogous to the regexp ".*".

The function declaration will be:

Let's start with the analysis.

The first thing to notice about this problem is that some regular expressions

in it are impossible to match. The "a*a" will never match anything because

the greedy "a*" will consume all the "a"s, and the second "a" will never encounter

a match. The same goes for "*." followed by anything, because ".*" will

consume everything to the end of the string.

The other thing to consider is, could the strings be in UTF-8? UTF-8 can

be handled transparently in many ways but this is not one of them. If

treated simple-mindedly, the "*" after a multi-byte Unicode character would

expect the repeat of its last byte. And a "." would match a single byte of

a multibyte character. Specifically for UTF-8, this problem

can be handled by keeping track of the continuation. Every byte of a

multi-byte character in UTF has the high bit 0x80 set, and the first one of

them has also the bit 0x40 set, so you can parse where the character

starts. I've looked up the exact bitmasks on the Internet, it's not

something that I'd expect anyone to know off-hand at the interview, nor the

details of the exact UTF-8 encoding, but the general principle that UTF-8

is easy to parse is a reasonable knowledge. However this kind of parsing is a

bad idea anyway, because it would mess up the handling of the strings in

the single-byte 8-bit encodings. A much better approach is to use the

library functions to either convert the whole strings to wchar_t (requires

the allocation of extra memory) or to parse the strings

character-by-character as they get read (better). This is important to

mention to show that you understand the multi-byte encoding issues, and can

bring you bonus points at an interview. But the problem itself is not about

that, the multi-byte support can be trivially added after the general

solution is found, so let's assume that we don't need to care about that

for now, and the strings are in ASCII.

The first solution goes in the most complicated way I can think of. You

might have attended a college course on parsing that talked about the

finite machine matcher for the regular expressions. The most unsophisticated

approach is to push this way blindly.

Each node of the finite machine graph could be a plain array of the

possible exits from that node, but to make things easier to write, let's

use a C++ map:

Since we're dynamically allocating the Nodes, we need to take

care of freeing them too. A simple way to do it is to use the reference

counting in the form of shared_ptr or unique_ptr. But we can't store them

right in the Node, because the Nodes may have circular references. The

unique_ptr just will refuse to do so (if you insist with std::move(),

you'll get NULL pointers in the unexpected places, and will probably

crash), and shared_ptr would create a link loop and leak the memory. It's

possible to arrange the entries in a plain list, and then free them at the

end manually. But the easier way to do it is to keep the reference counts

in a separate vector, and let the nodes connect in whatever ways with the

pointers. And these should be the real pointers, not the weak references.

The weak references are needed when you might need them to become strong at

some point. Here we have no such need, so save the overhead.

We can also use the holder to find the first and last allocated node, so we

don't need the separate parsed pointer.

What should the loop do? Since the Nodes contain the map of exit links and

not entry links, on each literal character in the pattern we'll need to

create a new node, and then update the _previous_ node with the exit link

to the new one. So we need to start by allocating one initial node before

the loop, to always have a previous node.

When we see a '.', we populate the map for every character code except '\0' to

point to the next node. If we had to deal with the wide characters, this

wouldn't work, because the number of characters would be too great. Then

we'd have to add a special "default" link. But for ASCII it's good enough.

When we see '*', we don't allocate a new node but make the previous node a

copy of its previous node, pointing to itself (so we should also keep a

pointer to the node before last).

And when we see a '\0', we will add to the last node an exit that maps the

'\0' to a NULL pointer, this being a signal to stop. We could also add a

separate boolean field final_ for this purpose, but a null pointer is just

as good, and safer in the way that we wouldn't need a separate code path

for checking the finality, and will always check all the pointers for NULL.

Here is the loop fleshed out:

This code has to deal with some situations that "shouldn't happen", the

invalid patterns that either start with a "*" or have multiple "*" in a

row. This is something that needs to be called out explicitly. Two ways to

deal with it is to either return some kind of an error indication (such as

throw an exception) or just process them in some not officially defined

way. This code takes the approach of the silent handling, simply because

it's easier to do in a small code snippet. From the caller's standpoint it

might be either good or bad: the good is that the caller won't have to

handle the errors, the bad is that the author of the pattern might be

surprised by its effect. Whatever choice you make, you need to be aware

of this trade-off. The silent handling here is that the "*" at the start is

treated as a literal, and multiple "*" are treated as one.

This code is also not correct in the handling of "*", in two ways! How do I

know this? I tried to run through it manually on the examples of what I see

like corner cases: "", "a", ".", "a*", "a*b", "a*a", ".*", ".*a" with the

matching and mismatching strings, and I actually tried the more complex

cases first.

The first defect is that it doesn't handle the greediness right. A pattern

like ".*a" will not be unmatchable, instead it would mean what in the full

regexps can be expressed as "[^a]*a". Which might be not bad by itself but

isn't what the problem statement says. Worse yet, due to the same mistake

the pattern "a*a" will be equivalent to a single "a", which is outright

wrong. The fix for this problem is to check whether the map value at this

key has been already set, and if so then don't overwrite it:

The second problem is even worse, "*" ends up meaning "1 or more

repetitions" instead of "0 or more repetitions", like the operator "+" in

the full regexps. To fix that, on the next character after a "*" we need to

add the new link to both the previous and pre-previous node.

The state machine graph for the expression "a*b" used to look like this:

Now it will look like this:

The mistake I've made here is that I haven't started writing out the

explicit graph examples until after finding the bug. It could have been

prevented by thinking through the graphs from the start.

The code change is:

This finally took care of the pattern parsing. Mostly. It still has a bug

with handling the sequences of starred characters, such as matching the

pattern "a*b*c" to the string "c". It will not accept "c" because at each

point we update only one previously starred node, not the whole sequence of

them. This is a pretty tricky code, and I didn't even notice this error

until I started doing this problem the next, easier, way, and noticed it

there, and then realized that the first version had this issue too.

As it has turned out after I drew the graph and traced the code, it's even

worse than that. In the "a*b*c", it would accept any mix of "a"s and "b"s,

rather than strictly "a"s followed by "b"s, caused by the copying of all the

links from the node before star.

Well, let's fix it. The graph for "a*b*c" used to look like this:

It should look like this:

Whenever there is a sequence of starred characters, there has got to be a

link from before each starred node to each following starred node, and to

the node past the stars. The only way I could figure this out was with the

graph diagram.

This is very difficult to do by just following the graph, because there is

no indication in the graph, which link is going to the following node and

which go to the other nodes including itself.

The better way to resolve it is to notice that we're appending the nodes to

the vector sequentially anyway. So we can use the vector order to go

through them in sequence. Thus the variable before_star turns from a

pointer into a vector index:

The setting of all exits on a "." became used in multiple places, so it

moved into set_exit(). The parsing of the star syntax became more

complicated, requiring to remember the last character, and to keep the

state for detection of where the sequence of starred characters ends.

The "classic" way of state handling is to put the state processing as a

switch statement in the loop but that would have made the code way too

painful and repetitions, so I've decided to skip that example of doing

things per an unsuitable pattern and move on right to the more

straightforward version above.

Now let's do the pattern matching:

It's done but is hadn't been a small feat, with all the memory management

thrown in, and all the opportunities to make the mistakes in the pattern

building. It can be powered though as a show of brute force, but I don't

think that I've done it all in under an hour, which it the usual interview

time slot.

I hope that at the very least by the time I've started talking about

building the holder vector, you've got a better idea: this vector goes in

parallel with the original pattern string, so why not just work directly

with the pattern string? Indeed, this is a much easier way. But it also has

the harder and easier version.

Again, let's start with the harder version. The loop will be working in

very much the same way as the matching loop in the parsed-pattern version

(as some textbooks would teach), but will read the pattern directly from

the string.

The "for" loop here is interesting, without an exit condition. This is

because the matching of '\0' follows the common pattern, and is done in the

same loop code, with the only addition being the check for the end of lines

when we detect a match of the literals. Otherwise, if we'd checked the text

and pattern for EOL in the "for" loop condition, we'd have to repeat the

inner "while" loop once more after the "for" loop, to account for the case

when the pattern ends with a sequence of starred characters.

The inner loop is necessary to handle the sequences of multiple starred

characters, such as "a*b*c" matching the "c". If we don't do the loop, "c"

would get compared to "b" and the match will be considered failed. Since

this code is much smaller and simpler than the parsed-pattern version, this

becomes much easier to notice and implement - just add a loop!

Speaking of the stars, this version takes a different approach to the

improperly placed stars in the pattern, just because it was easier to do

this way. The leading star gets ignored, and the second star in a row gets

treated as a literal. Which is OK, since this is the behavior for the

undefined patterns (unless you're writing code that must be compatible with

some existing library and need to repeat its quirks).

This code also contains a bug. It happily matches the '.' followed by

anything in the pattern to an end of line in the text. To fix that bug, the

conditions of the form

need to be replaced with

There is another way to take the same general approach but keep the main

loop more classic, by using a function for the internal while loop, and

then this function can be easily called twice. It isn't any better or

worse, just different (and I've included the bug fix into it):

And as yet another variation of the same, you could also look ahead for the

stars after the pattern and save this state in a separate variable:

I think that this version is much simpler and in the real world if I

started writing the previous version, I would have never finished it, and

switched to this one.

But even this version is not great. Having to keep state feels like too much

work. It would be much easier to go the other way around, iterate through

the pattern and consume the matching characters from the text. Then the

state will become embodied in the code rather than variables, and will be

simpler:

This version is much smaller and much easier to follow through. It does an

explicit selection by the type of each pattern element, so each one of them

has its own code fragment and not the logic that is spread through the code

and mixed with the logic of the other elements. And all this makes the

creation of bugs more difficult.

A slightly special thing about this version is that it looks ahead by two

characters, not just one, to detect that the current pattern character is

followed by a star. It's not something that's usually taught but it makes

the code a lot easier.

This version also has a theoretical foundation: it's a recursive-descent

LL(1) parser of the text. Except that the regular expressions define a

non-recursive language, so there is no recursion. It really is perfectly

per textbook, you've just got to pick the right textbook! It also parses

not a fixed grammar but one given in the regular expression pattern. So

it's an LL(2) parser of the pattern, with the nested LL(1) parsers of the

matching substrings in the text. The 2 in LL(2) means that we're looking

ahead by 2 characters. It can also be parsed by an LL(1) parser as has been

shown before but looking ahead by 2 characters makes it easier.

It is the version that kind of came to my mind almost right away when I

first thought about this problem. But I can't really say that it just pops

in my mind out of nowhere. I do size up the different approaches in my

mind, and try the ones that look simpler first. It doesn't mean that this

first estimation is always right. Sometimes I go pretty deep along one

approach before deciding to abandon it, and applying the lessons learned to

another approach. And sometimes this another approach end up even worse,

but the lessons learned there help to get through the logjam on the first

approach.

So if you start with the poor approaches, you can still arrive at the

better ones by listening to the hints that the code gives to you as you

write it. When you see an easier way to go, use it. You can also power

through the difficult approaches to the successful end but that tends to be

much more difficult than switching the approach to a simpler one.

approach to problem solving, and I've decided to try using another

interview problem as an example. It's an older problem, and I really like

it, but I don't use it any more because the other people liked and used it

too, and the solutions are already available on the Internet.

I'm not just going to give the solution, blindly remembering that is of no

use to anyone. I want to show how the better solutions can be born out of

the bad solutions. I'm going to start with the worst solution I can think

of, and then gradually show the better solutions. The puzzle for you, the

reader, is to use the issues in these solutions as hints towards

the better solutions, that would take you as far ahead as possible.

I wrote those solutions as I would do at an interview, without actually

compiling and running the code on a computer. So they might contain bugs,

from trivial bugs to the major ones. Whenever I've noticed a major bug

after writing a solution and following through its execution manually, I've

left the buggy version in, and added the fixed version. Sometimes a solution

was so overcomplicated that I didn't notice a bug in it until I moved on to

another, simpler solution. There still might be bugs, but hopefully not too

bad ones. It just shows that the overcomplicated approaches should be

avoided.

The problem is to write a matcher for the very simple regular expressions,

that include only the operators "." (any character) and "*" (any number of

repeats of the previous character). The "*" is greedy, consuming as many

matching characters as possible. There is no escape character like

backslash. The string must match completely, as if the regexp implicitly

had "^" at the font and "$" at the end.

For the Windows people who don't know regular expressions, the regexps are

a way of pattern matching, kind of like the file patterns (which are known

in the Unix world as globs) but different. Most of the characters in the

regexps have the literal meaning, just as in the file patterns. The

character "." is like "?" in the file patterns, matching exactly one any single

character. The character "*" is different than in the file patterns. It

means zero or more repetitions of the previous element, so the file pattern

"*" is analogous to the regexp ".*".

The function declaration will be:

bool match(const char *pattern, const char *text);

Let's start with the analysis.

The first thing to notice about this problem is that some regular expressions

in it are impossible to match. The "a*a" will never match anything because

the greedy "a*" will consume all the "a"s, and the second "a" will never encounter

a match. The same goes for "*." followed by anything, because ".*" will

consume everything to the end of the string.

The other thing to consider is, could the strings be in UTF-8? UTF-8 can

be handled transparently in many ways but this is not one of them. If

treated simple-mindedly, the "*" after a multi-byte Unicode character would

expect the repeat of its last byte. And a "." would match a single byte of

a multibyte character. Specifically for UTF-8, this problem

can be handled by keeping track of the continuation. Every byte of a

multi-byte character in UTF has the high bit 0x80 set, and the first one of

them has also the bit 0x40 set, so you can parse where the character

starts. I've looked up the exact bitmasks on the Internet, it's not

something that I'd expect anyone to know off-hand at the interview, nor the

details of the exact UTF-8 encoding, but the general principle that UTF-8

is easy to parse is a reasonable knowledge. However this kind of parsing is a

bad idea anyway, because it would mess up the handling of the strings in

the single-byte 8-bit encodings. A much better approach is to use the

library functions to either convert the whole strings to wchar_t (requires

the allocation of extra memory) or to parse the strings

character-by-character as they get read (better). This is important to

mention to show that you understand the multi-byte encoding issues, and can

bring you bonus points at an interview. But the problem itself is not about

that, the multi-byte support can be trivially added after the general

solution is found, so let's assume that we don't need to care about that

for now, and the strings are in ASCII.

The first solution goes in the most complicated way I can think of. You

might have attended a college course on parsing that talked about the

finite machine matcher for the regular expressions. The most unsophisticated

approach is to push this way blindly.

Each node of the finite machine graph could be a plain array of the

possible exits from that node, but to make things easier to write, let's

use a C++ map:

using namespace std; struct Node { map<char, Node*> exits_; }; The parsing of the pattern will start like this: bool match(const char *pattern, const char *text) { vector<shared_ptr<Node> > holder; Node *parsed = nullptr; for (const char *p = pattern; *p != 0; p++) { ... } }

Since we're dynamically allocating the Nodes, we need to take

care of freeing them too. A simple way to do it is to use the reference

counting in the form of shared_ptr or unique_ptr. But we can't store them

right in the Node, because the Nodes may have circular references. The

unique_ptr just will refuse to do so (if you insist with std::move(),

you'll get NULL pointers in the unexpected places, and will probably

crash), and shared_ptr would create a link loop and leak the memory. It's

possible to arrange the entries in a plain list, and then free them at the

end manually. But the easier way to do it is to keep the reference counts

in a separate vector, and let the nodes connect in whatever ways with the

pointers. And these should be the real pointers, not the weak references.

The weak references are needed when you might need them to become strong at

some point. Here we have no such need, so save the overhead.

We can also use the holder to find the first and last allocated node, so we

don't need the separate parsed pointer.

What should the loop do? Since the Nodes contain the map of exit links and

not entry links, on each literal character in the pattern we'll need to

create a new node, and then update the _previous_ node with the exit link

to the new one. So we need to start by allocating one initial node before

the loop, to always have a previous node.

When we see a '.', we populate the map for every character code except '\0' to

point to the next node. If we had to deal with the wide characters, this

wouldn't work, because the number of characters would be too great. Then

we'd have to add a special "default" link. But for ASCII it's good enough.

When we see '*', we don't allocate a new node but make the previous node a

copy of its previous node, pointing to itself (so we should also keep a

pointer to the node before last).

And when we see a '\0', we will add to the last node an exit that maps the

'\0' to a NULL pointer, this being a signal to stop. We could also add a

separate boolean field final_ for this purpose, but a null pointer is just

as good, and safer in the way that we wouldn't need a separate code path

for checking the finality, and will always check all the pointers for NULL.

Here is the loop fleshed out:

bool match(const char *pattern, const char *text) { vector<shared_ptr<Node> > holder; holder.push_back(make_shared<Node>()); Node *previous = nullptr; for (const char *p = pattern; *p != 0; p++) { if (*p == '*' && previous != nullptr) { holder.back()->exits_ = previous->exits_; } else { previous = holder.back().get(); holder.push_back(make_shared<Node>()); if (*p == '.') { for (int c = 1; c <= 255; c++) { previous->exits_[(char)c] = holder.back().get(); } } else { previous->exits_[*p] = holder.back().get(); } } } holder.back()->exits_['\0'] = nullptr; }

This code has to deal with some situations that "shouldn't happen", the

invalid patterns that either start with a "*" or have multiple "*" in a

row. This is something that needs to be called out explicitly. Two ways to

deal with it is to either return some kind of an error indication (such as

throw an exception) or just process them in some not officially defined

way. This code takes the approach of the silent handling, simply because

it's easier to do in a small code snippet. From the caller's standpoint it

might be either good or bad: the good is that the caller won't have to

handle the errors, the bad is that the author of the pattern might be

surprised by its effect. Whatever choice you make, you need to be aware

of this trade-off. The silent handling here is that the "*" at the start is

treated as a literal, and multiple "*" are treated as one.

This code is also not correct in the handling of "*", in two ways! How do I

know this? I tried to run through it manually on the examples of what I see

like corner cases: "", "a", ".", "a*", "a*b", "a*a", ".*", ".*a" with the

matching and mismatching strings, and I actually tried the more complex

cases first.

The first defect is that it doesn't handle the greediness right. A pattern

like ".*a" will not be unmatchable, instead it would mean what in the full

regexps can be expressed as "[^a]*a". Which might be not bad by itself but

isn't what the problem statement says. Worse yet, due to the same mistake

the pattern "a*a" will be equivalent to a single "a", which is outright

wrong. The fix for this problem is to check whether the map value at this

key has been already set, and if so then don't overwrite it:

struct Node { map<char, Node*> exits_; void set_exit(char c, Node *to) { if (exits_.find(c) == exits_.end()) { exits_[c] = to; } } }; bool match(const char *pattern, const char *text) { vector<shared_ptr<Node> > holder; holder.push_back(make_shared<Node>()); Node *previous = nullptr; for (const char *p = pattern; *p != 0; p++) { if (*p == '*' && previous != nullptr) { holder.back()->exits_ = previous->exits_; } else { previous = holder.back().get(); holder.push_back(make_shared<Node>()); if (*p == '.') { for (int c = 1; c <= 255; c++) { previous->set_exit((char)c, holder.back().get()); } } else { previous->set_exit(*p, holder.back().get()); } } } holder.back()->set_exit('\0', nullptr); }

The second problem is even worse, "*" ends up meaning "1 or more

repetitions" instead of "0 or more repetitions", like the operator "+" in

the full regexps. To fix that, on the next character after a "*" we need to

add the new link to both the previous and pre-previous node.

The state machine graph for the expression "a*b" used to look like this:

a +-+ a \ / b \0 O-----O-----O-----X

Now it will look like this:

a +-+ a \ / b \0 O-----O-----O-----X \ / +-------+ b

The mistake I've made here is that I haven't started writing out the

explicit graph examples until after finding the bug. It could have been

prevented by thinking through the graphs from the start.

The code change is:

bool match(const char *pattern, const char *text) { vector<shared_ptr<Node> > holder; holder.push_back(make_shared<Node>()); Node *previous = nullptr; Node *before_star = nullptr; for (const char *p = pattern; *p != 0; p++) { if (*p == '*' && previous != nullptr) { if (before_star == nullptr) { holder.back()->exits_ = previous->exits_; before_star = previous; } } else { previous = holder.back().get(); holder.push_back(make_shared<Node>()); if (*p == '.') { for (int c = 1; c <= 255; c++) { previous->set_exit((char)c, holder.back().get()); if (before_star) { before_star->set_exit((char)c, holder.back().get()); } } } else { previous->set_exit(*p, holder.back().get()); if (before_star) { before_star->set_exit(*p, holder.back().get()); } } before_star = nullptr; } } holder.back()->set_exit('\0', nullptr); if (before_star) { before_star->set_exit('\0', nullptr); } }

This finally took care of the pattern parsing. Mostly. It still has a bug

with handling the sequences of starred characters, such as matching the

pattern "a*b*c" to the string "c". It will not accept "c" because at each

point we update only one previously starred node, not the whole sequence of

them. This is a pretty tricky code, and I didn't even notice this error

until I started doing this problem the next, easier, way, and noticed it

there, and then realized that the first version had this issue too.

As it has turned out after I drew the graph and traced the code, it's even

worse than that. In the "a*b*c", it would accept any mix of "a"s and "b"s,

rather than strictly "a"s followed by "b"s, caused by the copying of all the

links from the node before star.

Well, let's fix it. The graph for "a*b*c" used to look like this:

a a,b +-+ +-+ a \ / b \ / c \0 O-----O-------O-------O-------X \ \ / / +-----\---+ / b +---------+ c

It should look like this:

a b +-+ +-+ a \ / b \ / c \0 O-----O-------O-------O-------X |\ \ / /| | +-----\---+ / | | b +---------+ | | c | \ / +-----------------+ c

Whenever there is a sequence of starred characters, there has got to be a

link from before each starred node to each following starred node, and to

the node past the stars. The only way I could figure this out was with the

graph diagram.

This is very difficult to do by just following the graph, because there is

no indication in the graph, which link is going to the following node and

which go to the other nodes including itself.

The better way to resolve it is to notice that we're appending the nodes to

the vector sequentially anyway. So we can use the vector order to go

through them in sequence. Thus the variable before_star turns from a

pointer into a vector index:

struct Node { map<char, Node*> exits_; void set_exit(char c, Node *to) { if (c == '.') { for (int i = 1; i <= 255; i++) { if (exits_.find((char)i) == exits_.end()) { exits_[i] = to; } } } else { if (exits_.find(c) == exits_.end()) { exits_[c] = to; } } } }; bool match(const char *pattern, const char *text) { vector<shared_ptr<Node> > holder; holder.push_back(make_shared<Node>()); int first_link = 0; char last_char; enum State { AFTER_STAR, AFTER_LITERAL, }; State state = AFTER_LITERAL; for (const char *p = pattern; *p != 0; p++) { if (*p == '*' && holder.size() > 0) { holder.back()->set_exit(last_char, holder.back().get()); state = AFTER_STAR; } else { last_char = *p; if (state == AFTER_LITERAL) { first_link = holder.size() - 1; } holder.push_back(make_shared<Node>()); for (int i = first_link; i < holder_size() - 1; i++) holder[i]->set_exit(*p, holder.back().get()); state = AFTER_LITERAL; } } for (int i = first_link; i < holder_size() - 1; i++) holder[i]->set_exit('\0', nullptr); }

The setting of all exits on a "." became used in multiple places, so it

moved into set_exit(). The parsing of the star syntax became more

complicated, requiring to remember the last character, and to keep the

state for detection of where the sequence of starred characters ends.

The "classic" way of state handling is to put the state processing as a

switch statement in the loop but that would have made the code way too

painful and repetitions, so I've decided to skip that example of doing

things per an unsuitable pattern and move on right to the more

straightforward version above.

Now let's do the pattern matching:

bool match(const char *pattern, const char *text) { // ... skipped the pattern parsing ... Node *cur = holder.front().get(); for (const char *t = text; cur != nullptr; t++) { auto it = cur->exits_.find(*t); if (it == cur->exits_.end()) return false; cur = it->second; } return true; }

It's done but is hadn't been a small feat, with all the memory management

thrown in, and all the opportunities to make the mistakes in the pattern

building. It can be powered though as a show of brute force, but I don't

think that I've done it all in under an hour, which it the usual interview

time slot.

I hope that at the very least by the time I've started talking about

building the holder vector, you've got a better idea: this vector goes in

parallel with the original pattern string, so why not just work directly

with the pattern string? Indeed, this is a much easier way. But it also has

the harder and easier version.

Again, let's start with the harder version. The loop will be working in

very much the same way as the matching loop in the parsed-pattern version

(as some textbooks would teach), but will read the pattern directly from

the string.

bool match(const char *pattern, const char *text) { char last_pattern = '\0'; const char *p = pattern; for (const char *t = text; ; t++) { while (true) { // for starred sequences in pattern // at this point p always points after the last literal // character but possibly before the following star if (*p == '*') { if (last_pattern == '.' || *t == last_pattern) { if (*t == 0) return true; break; } ++p; } last_pattern = *p; if (*p == '.' || *t == *p) { if (*t == 0) return true; ++p; break; } if (*p == 0) return false; ++p; if (*p != '*') return false; } } return false; // never reached }

The "for" loop here is interesting, without an exit condition. This is

because the matching of '\0' follows the common pattern, and is done in the

same loop code, with the only addition being the check for the end of lines

when we detect a match of the literals. Otherwise, if we'd checked the text

and pattern for EOL in the "for" loop condition, we'd have to repeat the

inner "while" loop once more after the "for" loop, to account for the case

when the pattern ends with a sequence of starred characters.

The inner loop is necessary to handle the sequences of multiple starred

characters, such as "a*b*c" matching the "c". If we don't do the loop, "c"

would get compared to "b" and the match will be considered failed. Since

this code is much smaller and simpler than the parsed-pattern version, this

becomes much easier to notice and implement - just add a loop!

Speaking of the stars, this version takes a different approach to the

improperly placed stars in the pattern, just because it was easier to do

this way. The leading star gets ignored, and the second star in a row gets

treated as a literal. Which is OK, since this is the behavior for the

undefined patterns (unless you're writing code that must be compatible with

some existing library and need to repeat its quirks).

This code also contains a bug. It happily matches the '.' followed by

anything in the pattern to an end of line in the text. To fix that bug, the

conditions of the form

if (*p == '.' || *t == *p)

need to be replaced with

if (*t == *p || *t != 0 && *p == '.')

There is another way to take the same general approach but keep the main

loop more classic, by using a function for the internal while loop, and

then this function can be easily called twice. It isn't any better or

worse, just different (and I've included the bug fix into it):

bool match_one(char &last_pattern, const char *&p, char c) { while (true) { // for starred sequences in pattern // at this point p always points after the last literal // character but possibly before the following star if (*p == '*') { if (c == last_pattern || c != 0 && last_pattern == '.') { break; } ++p; } last_pattern = *p; if (c == *p || c != 0 && *p == '.') ++p; break; } if (*p == 0) return false; ++p; if (*p != '*') return false; } return true; } bool match(const char *pattern, const char *text) { char last_pattern = '\0'; const char *p = pattern; for (const char *t = text; *t != 0; t++) { if (!match_one(last_pattern, p, *t)) return false; } if (!match_one(last_pattern, p, '\0')) return false; return true; }

And as yet another variation of the same, you could also look ahead for the

stars after the pattern and save this state in a separate variable:

bool match(const char *pattern, const char *text) { char last_pattern = '\0'; bool last_starred = false; const char *p = pattern; for (const char *t = text; ; t++) { while (true) { // for starred sequences in pattern // at this point p always points before the next literal // character if (last_starred) { if (last_pattern == '.' || *t == last_pattern) { break; } } last_pattern = *p++; last_starred = (last_pattern != 0 && *p == '*'); if (last_starred) // skip over the star ++p; if (*t == last_pattern || *t != 0 && last_pattern == '.') { if (*t == 0) return true; break; } if (!last_starred) return false; } } return false; // never reached }

I think that this version is much simpler and in the real world if I

started writing the previous version, I would have never finished it, and

switched to this one.

But even this version is not great. Having to keep state feels like too much

work. It would be much easier to go the other way around, iterate through

the pattern and consume the matching characters from the text. Then the

state will become embodied in the code rather than variables, and will be

simpler:

bool match(const char *pattern, const char *text) { const char *t = text; for (const char *p = pattern; *p != 0; p++) { if (p[1] == '*') { if (*p == '.') { while (*t) ++t; } else { while (*t == *p) ++t; } ++p; // adjust to consume the star } else if (*p == '.') { if (*t++ == 0) return false; } else { if (*t++ != *p) return false; } } return *t == 0; }

This version is much smaller and much easier to follow through. It does an

explicit selection by the type of each pattern element, so each one of them

has its own code fragment and not the logic that is spread through the code

and mixed with the logic of the other elements. And all this makes the

creation of bugs more difficult.

A slightly special thing about this version is that it looks ahead by two

characters, not just one, to detect that the current pattern character is

followed by a star. It's not something that's usually taught but it makes

the code a lot easier.

This version also has a theoretical foundation: it's a recursive-descent

LL(1) parser of the text. Except that the regular expressions define a

non-recursive language, so there is no recursion. It really is perfectly

per textbook, you've just got to pick the right textbook! It also parses

not a fixed grammar but one given in the regular expression pattern. So

it's an LL(2) parser of the pattern, with the nested LL(1) parsers of the

matching substrings in the text. The 2 in LL(2) means that we're looking

ahead by 2 characters. It can also be parsed by an LL(1) parser as has been

shown before but looking ahead by 2 characters makes it easier.

It is the version that kind of came to my mind almost right away when I

first thought about this problem. But I can't really say that it just pops

in my mind out of nowhere. I do size up the different approaches in my

mind, and try the ones that look simpler first. It doesn't mean that this

first estimation is always right. Sometimes I go pretty deep along one

approach before deciding to abandon it, and applying the lessons learned to

another approach. And sometimes this another approach end up even worse,

but the lessons learned there help to get through the logjam on the first

approach.

So if you start with the poor approaches, you can still arrive at the

better ones by listening to the hints that the code gives to you as you

write it. When you see an easier way to go, use it. You can also power

through the difficult approaches to the successful end but that tends to be

much more difficult than switching the approach to a simpler one.

## Thursday, October 25, 2018

### how to reinvent the bicycle

I've been recently conducting some job interviews, and met a spate of junior engineer candidates with a similar issue: they could quickly come up with pretty good overall idea of the solution, and they can write the code as such, but they would fail to translate this overall idea into the code. They couldn't divide the overall idea into components and then step by step work through the details and interdependencies of these components and sub-components, even with the intense hinting from my side.

This is something that should be done almost mechanically, with little mental effort. And yet they could not do it, they try to do it by intuition, but their intuition is not strong enough to handle this complex problem, and they don't know how to use the mechanical approach either. The hints don't help much because they don't cause the right mechanistic associations.

The problem I ask is actually quite difficult, too difficult for a perfectly adequate junior-to-midlevel engineer, and I'm not sure if I myself would have solved it well some 20 years ago (I know that I can solve it now, it came from my real-life experience where I had to do this thing from scratch real fast). So I don't expect a good solution from this category of candidates, a so-so solution is plenty good enough for them (although some of them do very well, producing a fully completed optimal solution). But there is a difference between producing a complete solution that is not that good and getting the right overall idea, figuring out the conceptual parts that I consider difficult and important (that the first group never figures out), only to fail miserably to work out all the details necessary to write the code. Not that the code doesn't get produced at all (though sometimes it doesn't), but what gets produced has glaring holes and is much worse than the code produced by the first group. Interestingly, I see a good deal more women than men in this second group. The first group is, I think, about even between men and women; this sounds like a statistical impossibility but this second group is small compared to the first one and doesn't shift the overall averages that much. Granted, the sample size is small, so it might well be a statistical fluctuation, or maybe not.

I don't want to discuss the exact problem I use, because I don't want it to spread over the internets, making me invent another one. But I've come up with a good analogy that illustrates the issue. Some time ago I've read about the artists that would ask people to draw a bicycle, and then would produce a bicycle in this drawn shape as an art object. It's suitable to be only an art object because it's completely non-functional. If I were to draw a bicycle without thinking, I would also produce something like that. But spending some thought, any engineer should be able to reproduce a proper bicycle from the general logic: the function of the main components (wheels, steering, seat, pedals, chain) and the general considerations of the strength of the frame that shape it. The same logic can be used to check that none of the main components were forgotten: for example, if you forget about the chain, the pedals would be left disconnected from the rear wheel, so you'd have to recall it. Each component might be very non-trivial (the said chain took a long time to invent), but once you know the components, it should be impossible to put them into a wrong place. If an engineer figured out the chain but couldn't put it into the right place, there is something very wrong with this engineer.

The problem, I think, is that people are not really taught to do this kind of thinking in programming. The books and college courses describe the syntax of the programming languages and the general picture but leave a void between these layers. People learn this on their own from the examples and practice. But the examples and practice tend to train the intuition, and people are left to figure out the mechanistic approach on their own, and they either figure it out or they don't. It looks like quite a few of the generally smart people either don't or take a long time to develop it. Not to say that there is anything wrong with intuition, it's my favorite thing, but the mechanistic approach allows to stretch a good deal beyond the immediate reach of intuition, and to grow the future intuition.

I've recently seen a question on Quora, approximately "As you gain more experience, do you still write code that works but you don't know why?", and this I think is exactly the difference between the intuitive and mechanistic solutions. The intuition might give you some code that works, or possibly doesn't. The mechanistic approach lets you verify that what the intuition provided actually does what it's supposed to do, and provides the stepping stones for the intuition to go further: both to fix what is going wrong, and to do the more complex multi-leap designs.

By the way, the functional programming in Haskell and such seems to be an area where people are exhibiting this issue in a particularly exaggerated way: they write the high-level code hoping that the compiler will do the mechanistic work for them, and the compiler does but often not in the way they expect.

Programming is not the only area with this kind of a teaching problem. I think math has the same issue. The way the proofs of various theorems are taught is usually not the way how the authors had discovered them. These proofs get edited and adjusted a lot to make them look easier to understand. But this loses the teaching aspect of how to create the new proofs.

So, how would you teach it? Perhaps by walking through the solutions of the complex problems, showing step by step, how the initial general ideas get gradually solidified, deducing the requirements for their elements, and recursively for the sub-elements, resolving the conflicts along the way, with setbacks and retries.

The bicycle example suggests that there is probably a general transferable skill too, and this skill can be trained by the puzzle games like the classic "The Incredible Machine" where the goal is to build a Rube Goldberg contraption to accomplish the particular goal from a set of components. Like in the real life, the tasks there might include the extra components that look useful but don't really work out, or provide multiple ways to reach the goal. This is of course different from programming in the way that you need to achieve only one exact goal, while in programming you have to solve a whole class of related goals that include the corner cases, but still is a good point to start.

So the better way to teach for programming would probably be a combination of the "lecture material" and of the puzzles where the students have to build the working code from a number of provided hints.

I actually know an old book that does something like this: Jon Bentley's "Programming Pearls". It's built around a somewhat different idea but maybe it's close enough.

This is something that should be done almost mechanically, with little mental effort. And yet they could not do it, they try to do it by intuition, but their intuition is not strong enough to handle this complex problem, and they don't know how to use the mechanical approach either. The hints don't help much because they don't cause the right mechanistic associations.

The problem I ask is actually quite difficult, too difficult for a perfectly adequate junior-to-midlevel engineer, and I'm not sure if I myself would have solved it well some 20 years ago (I know that I can solve it now, it came from my real-life experience where I had to do this thing from scratch real fast). So I don't expect a good solution from this category of candidates, a so-so solution is plenty good enough for them (although some of them do very well, producing a fully completed optimal solution). But there is a difference between producing a complete solution that is not that good and getting the right overall idea, figuring out the conceptual parts that I consider difficult and important (that the first group never figures out), only to fail miserably to work out all the details necessary to write the code. Not that the code doesn't get produced at all (though sometimes it doesn't), but what gets produced has glaring holes and is much worse than the code produced by the first group. Interestingly, I see a good deal more women than men in this second group. The first group is, I think, about even between men and women; this sounds like a statistical impossibility but this second group is small compared to the first one and doesn't shift the overall averages that much. Granted, the sample size is small, so it might well be a statistical fluctuation, or maybe not.

I don't want to discuss the exact problem I use, because I don't want it to spread over the internets, making me invent another one. But I've come up with a good analogy that illustrates the issue. Some time ago I've read about the artists that would ask people to draw a bicycle, and then would produce a bicycle in this drawn shape as an art object. It's suitable to be only an art object because it's completely non-functional. If I were to draw a bicycle without thinking, I would also produce something like that. But spending some thought, any engineer should be able to reproduce a proper bicycle from the general logic: the function of the main components (wheels, steering, seat, pedals, chain) and the general considerations of the strength of the frame that shape it. The same logic can be used to check that none of the main components were forgotten: for example, if you forget about the chain, the pedals would be left disconnected from the rear wheel, so you'd have to recall it. Each component might be very non-trivial (the said chain took a long time to invent), but once you know the components, it should be impossible to put them into a wrong place. If an engineer figured out the chain but couldn't put it into the right place, there is something very wrong with this engineer.

The problem, I think, is that people are not really taught to do this kind of thinking in programming. The books and college courses describe the syntax of the programming languages and the general picture but leave a void between these layers. People learn this on their own from the examples and practice. But the examples and practice tend to train the intuition, and people are left to figure out the mechanistic approach on their own, and they either figure it out or they don't. It looks like quite a few of the generally smart people either don't or take a long time to develop it. Not to say that there is anything wrong with intuition, it's my favorite thing, but the mechanistic approach allows to stretch a good deal beyond the immediate reach of intuition, and to grow the future intuition.

I've recently seen a question on Quora, approximately "As you gain more experience, do you still write code that works but you don't know why?", and this I think is exactly the difference between the intuitive and mechanistic solutions. The intuition might give you some code that works, or possibly doesn't. The mechanistic approach lets you verify that what the intuition provided actually does what it's supposed to do, and provides the stepping stones for the intuition to go further: both to fix what is going wrong, and to do the more complex multi-leap designs.

By the way, the functional programming in Haskell and such seems to be an area where people are exhibiting this issue in a particularly exaggerated way: they write the high-level code hoping that the compiler will do the mechanistic work for them, and the compiler does but often not in the way they expect.

Programming is not the only area with this kind of a teaching problem. I think math has the same issue. The way the proofs of various theorems are taught is usually not the way how the authors had discovered them. These proofs get edited and adjusted a lot to make them look easier to understand. But this loses the teaching aspect of how to create the new proofs.

So, how would you teach it? Perhaps by walking through the solutions of the complex problems, showing step by step, how the initial general ideas get gradually solidified, deducing the requirements for their elements, and recursively for the sub-elements, resolving the conflicts along the way, with setbacks and retries.

The bicycle example suggests that there is probably a general transferable skill too, and this skill can be trained by the puzzle games like the classic "The Incredible Machine" where the goal is to build a Rube Goldberg contraption to accomplish the particular goal from a set of components. Like in the real life, the tasks there might include the extra components that look useful but don't really work out, or provide multiple ways to reach the goal. This is of course different from programming in the way that you need to achieve only one exact goal, while in programming you have to solve a whole class of related goals that include the corner cases, but still is a good point to start.

So the better way to teach for programming would probably be a combination of the "lecture material" and of the puzzles where the students have to build the working code from a number of provided hints.

I actually know an old book that does something like this: Jon Bentley's "Programming Pearls". It's built around a somewhat different idea but maybe it's close enough.

## Thursday, October 11, 2018

### simplicity is in the eye of beholder

I've been recently discussing a design with someone. We both wanted the design to be simple but disagreed about the details. Eventually I've realized that we wanted the different kinds of simplicity.

For him, the simplicity was in the compact protocol messages. And if the endpoint logic to handle them becomes arcane, so be it.

For me, the simplicity was in the simple logic of the endpoints, and if the messages become larger and exchanged more frequently, so be it. (There are of course cases where the bandwidth matters but here the control messages represented a minor overhead either way).

For him, the simplicity was in the compact protocol messages. And if the endpoint logic to handle them becomes arcane, so be it.

For me, the simplicity was in the simple logic of the endpoints, and if the messages become larger and exchanged more frequently, so be it. (There are of course cases where the bandwidth matters but here the control messages represented a minor overhead either way).

## Thursday, September 6, 2018

### COW filesystems

I've recently learned the term "Copy-On-Write filesystem". I guess I've been living under a rock. Not that it's a new concept , NetApp was doing this kind of a filesystem 20 years ago, but I've never heard before (or maybe heard and forgot) the term "COW filesystem".

I've also realized an interesting things: the COW filesystems seem to be essentially the log-structured filesystems but done better. A major problem with the log-structured filesystems is the periodic cleaning of the log, when the dead data gets discarded and the live data gets compacted. If the filesystem contains a large amount of data, with only a small fraction of it changing (which is fairly typical), the compaction ends up moving vast amounts of data around to fill the small gaps left by the changed data.

A COW filesystem doesn't have a contiguous log, instead it collects the unreferenced blocks into a free list and can reuse them without moving the rest of the data. Yes, the writes become non-sequential on the device, but with the modern devices, especially SSDs, this doesn't matter much any more.

I've also realized an interesting things: the COW filesystems seem to be essentially the log-structured filesystems but done better. A major problem with the log-structured filesystems is the periodic cleaning of the log, when the dead data gets discarded and the live data gets compacted. If the filesystem contains a large amount of data, with only a small fraction of it changing (which is fairly typical), the compaction ends up moving vast amounts of data around to fill the small gaps left by the changed data.

A COW filesystem doesn't have a contiguous log, instead it collects the unreferenced blocks into a free list and can reuse them without moving the rest of the data. Yes, the writes become non-sequential on the device, but with the modern devices, especially SSDs, this doesn't matter much any more.

## Sunday, July 1, 2018

### graph equivalence 6: optimizations

<<Prev

I've already mentioned some possible optimizations but let's have another look at them.

The

The

1. Combine S(D) of this node and of all the directly connected nodes to produce S(D+1).

2. If the node has an ID assigned, assume H(D+1) = H(0) and stop. Otherwise continue with the next steps.

3. Include the hash H(0) at distance 0 of the node itself into H(D+1).

4. Iterate through the links connected to this node in the order of their tags (pre-sort them at the start of the algorithm). Within the links of the same tag, combine the previous hashes H(D) of the nodes at the other end of the links in a commutative way, such as by adding them together. Then include the tag of the link and the combined hashes of the nodes into H(D+1).

Skipping the sorting of the nodes on the links with the same tag reduces the time to O(N). Then the main algorithm can be re-shuffled a bit by moving the search for the unique nodes into the outer loop:

1. While there are any nodes with unassigned ID:

1.1. Repeat for D=0...N+1

1.1.1. For each node, compute H(D), S(D).

1.1.2. If all S(D)==S(D-1), break out of the loop (it means that the whole topology is included).

1.2. Order nodes in the ascending order by (ID, H(D)), the unassigned IDs go after all the assigned IDs.

1.3. See if any of the nodes with unassigned ID have unique H(D). If any of them do, assign the IDs for them, continuing from the last assigned ID, and going in the order of their H(D).

1.4. If there still are any nodes with unassgined ID, take one arbitrary node with the lowest H(D) and assign the next ID in sequence to it.

This way the computation of the loop in 1.1 would take O(N^2),. The sorting in 1.2 would take O(N*log(N)), the iteration in 1.3 will take O(N), and 1.4 is O(1). The largest one there is O(N^2). Then we multiply it by the outer loop executing N times and get O(N^3). So this optimization saved the factor of log(N) at the run time.

The

The

<<Prev

I've already mentioned some possible optimizations but let's have another look at them.

The

**first optimization**is that whenever a node gets assigned an ID, this ID can also be used as its hash H(0), or more exactly, since IDs start with 0, H(0) = ID+1, to avoid having the hash value of 0. It's not a very good hash but this sidesteps the whole problem of making sure that H(0) of each node with the assigned ID is unique. It also makes easy the checking that the hashes of the other nodes don't collide with the hashes of the nodes with the assigned IDs, just by checking that their value is greater than N.The

**second optimization**is something I've already talked about: the simplified computation of H(D+1) that goes like this:1. Combine S(D) of this node and of all the directly connected nodes to produce S(D+1).

2. If the node has an ID assigned, assume H(D+1) = H(0) and stop. Otherwise continue with the next steps.

3. Include the hash H(0) at distance 0 of the node itself into H(D+1).

4. Iterate through the links connected to this node in the order of their tags (pre-sort them at the start of the algorithm). Within the links of the same tag, combine the previous hashes H(D) of the nodes at the other end of the links in a commutative way, such as by adding them together. Then include the tag of the link and the combined hashes of the nodes into H(D+1).

Skipping the sorting of the nodes on the links with the same tag reduces the time to O(N). Then the main algorithm can be re-shuffled a bit by moving the search for the unique nodes into the outer loop:

1. While there are any nodes with unassigned ID:

1.1. Repeat for D=0...N+1

1.1.1. For each node, compute H(D), S(D).

1.1.2. If all S(D)==S(D-1), break out of the loop (it means that the whole topology is included).

1.2. Order nodes in the ascending order by (ID, H(D)), the unassigned IDs go after all the assigned IDs.

1.3. See if any of the nodes with unassigned ID have unique H(D). If any of them do, assign the IDs for them, continuing from the last assigned ID, and going in the order of their H(D).

1.4. If there still are any nodes with unassgined ID, take one arbitrary node with the lowest H(D) and assign the next ID in sequence to it.

This way the computation of the loop in 1.1 would take O(N^2),. The sorting in 1.2 would take O(N*log(N)), the iteration in 1.3 will take O(N), and 1.4 is O(1). The largest one there is O(N^2). Then we multiply it by the outer loop executing N times and get O(N^3). So this optimization saved the factor of log(N) at the run time.

The

**third optimization**is kind of straightforward: when sorting the nodes by H(D), there is really no need to include the nodes with the assigned IDs into the sort, they will stay first anyway. It's not a very might optimization but saves a little.The

**fourth optimization**is that the signatures for the graphs can be computed in parallel before comparison. So if you have a million graphs, that allows for some impressive parallelism.<<Prev

## Saturday, June 16, 2018

### graph equivalence 5: resolving the hash collisions

<<Prev Next>>

Here is the version with the hash collision resolution:

1. While there are any nodes with unassigned ID:

1.1. Repeat for D=0...N+1

1.1.1. For each node, compute H(D), S(D).

1.1.2. Order nodes in the ascending order by (ID, H(0)), the unassigned IDs go after all the assigned IDs.

1.1.3. If there are multiple nodes with the same H(D), compare their topology:

1.1.3.1. Pick the first node with this H(D): if there is a node with an assigned ID and this H(D), pick it, otherwise pick a random one.

1.1.3.2. Compare the second and following nodes with the first node for equality: first the assigned ID, then the tags on the nodes themselves, then the number of links, then the tags on the links and hashes H(D-1) of the nodes they lead to (using the ordering of the links from the hash computation).

1.1.3.3. If the comparison had shown a difference, rehash H(D) of that node, find if there are any of the nodes with that ID, and repeat the comparison and rehashing until this node is either unique or finds a match.

1.1.4. If all S(D)==S(D-1), break out of the loop (it means that the whole topology is included).

1.1.5. See if any of the nodes with unassigned ID have unique H(D). If any of them do, assign the IDs for them, continuing from the last assigned ID, and going in the order of their H(D). Re-compute their H(0...D) based on the newly assgined ID.

1.1.5.1. Compare the H(D) of the nodes with the newly assigned IDs with the H(D) of all the other nodes. If a collision is detected, rehash and compare again, until the collision disappears.

1.2. If there are still any nodes with unassgined ID, take one arbitrary node with the lowest H(D) and assign the next IDs in sequence to it.

Why this works: On each iteration of the inner loop we guarantee that by the end of it the nodes with different topologies have different hashes. So at the next iteration we can use this property in the comparisons of the nodes: it's enough to compare only one link deep to find all the differences.

Let's estimate the run time: The two loops outside the main logic will still be O(N^2). Now let's look inside the loops. Since the hash collisions should be quite rare, we can compute the complexity of the collision resolution based on the case when the collisions are not detected. The worst case would be when all the nodes have the same hash, so up to (N-1) nodes will be compared with the first node. The complexity of comparing two nodes is proportional to the number of links on them, EperN. So we'd get O(N*EperN). We can assume a bit pessimistically that EperN is proportional to N, and then it will become O(N^2). The collision resolution of the nodes with the newly assigned IDs still requires comparison with all the other nodes but the comparison is cheap, only the hashes are compared. The total complexity of that would be O(N) per node, and if all nodes get assigned in one go then up to O(N^2) for all of them. But hey, it turns out that the computation of the hashes in the first place is still more expensive, taking O(N^2 * log(N)). So the totals would not change from the algorithm without the hash conflict resolution! It will still be O(N^4 * log(N)). Though of course the proportionality constant would change and the algorithm will become slower, and the optimization that skips the sorting to reduce the power of N won't be available any more.

When collating a large number of graphs, it's possible to optimize by running the version without collection detection first, and doing the first collation based on it. What if a collision happens? A different node might be picked to be assigned the same ID, so two equivalent graphs will end up with the different signatures. However since the signature of the graph includes its full topology, two different graphs will never get the same signature. The worst that can happen is that the equivalent graphs will be split between multiple buckets. So then the second pass can be run, computing the signatures with the hash collision resolution. This second run will combine the buckets that have been improperly split by the first pass. The benefit here is that instead of computing the more complex signature for each original graph, we would be computing it only once per bucket from the first pass. And the number of buckets can easily be a few orders of magnitude lower than the number of graphs.

If we're interested only in the top few buckets, we could even skip the whole long tail. But then we have to guard against the case of a popular graph being split into multiple buckets by the collisions. This can be resolved with 2-level bucketing. Along with the collision-agnostic signature, compute the "simple hash" of the graph by taking H(0) of all its nodes, ordering them and combining them. This simple hash will have the properties opposite to the collision agnostic signature: it would never split the equivalent graphs but it might combine the different graphs. So the simple hash is "optimistic" while the collision-agnostic signature is "pessimistic". Make the first level of bucketing optimistic and the second level of bucketing pessimistic. Then we can start by throwing away the long tail based on the optimistic bucketing. Then compute the exact signatures with collision resolution for each pessimistic class, combining them when necessary, and pick the final top elements.

<<Prev Next>>

Here is the version with the hash collision resolution:

1. While there are any nodes with unassigned ID:

1.1. Repeat for D=0...N+1

1.1.1. For each node, compute H(D), S(D).

1.1.2. Order nodes in the ascending order by (ID, H(0)), the unassigned IDs go after all the assigned IDs.

1.1.3. If there are multiple nodes with the same H(D), compare their topology:

1.1.3.1. Pick the first node with this H(D): if there is a node with an assigned ID and this H(D), pick it, otherwise pick a random one.

1.1.3.2. Compare the second and following nodes with the first node for equality: first the assigned ID, then the tags on the nodes themselves, then the number of links, then the tags on the links and hashes H(D-1) of the nodes they lead to (using the ordering of the links from the hash computation).

1.1.3.3. If the comparison had shown a difference, rehash H(D) of that node, find if there are any of the nodes with that ID, and repeat the comparison and rehashing until this node is either unique or finds a match.

1.1.4. If all S(D)==S(D-1), break out of the loop (it means that the whole topology is included).

1.1.5. See if any of the nodes with unassigned ID have unique H(D). If any of them do, assign the IDs for them, continuing from the last assigned ID, and going in the order of their H(D). Re-compute their H(0...D) based on the newly assgined ID.

1.1.5.1. Compare the H(D) of the nodes with the newly assigned IDs with the H(D) of all the other nodes. If a collision is detected, rehash and compare again, until the collision disappears.

1.2. If there are still any nodes with unassgined ID, take one arbitrary node with the lowest H(D) and assign the next IDs in sequence to it.

Why this works: On each iteration of the inner loop we guarantee that by the end of it the nodes with different topologies have different hashes. So at the next iteration we can use this property in the comparisons of the nodes: it's enough to compare only one link deep to find all the differences.

Let's estimate the run time: The two loops outside the main logic will still be O(N^2). Now let's look inside the loops. Since the hash collisions should be quite rare, we can compute the complexity of the collision resolution based on the case when the collisions are not detected. The worst case would be when all the nodes have the same hash, so up to (N-1) nodes will be compared with the first node. The complexity of comparing two nodes is proportional to the number of links on them, EperN. So we'd get O(N*EperN). We can assume a bit pessimistically that EperN is proportional to N, and then it will become O(N^2). The collision resolution of the nodes with the newly assigned IDs still requires comparison with all the other nodes but the comparison is cheap, only the hashes are compared. The total complexity of that would be O(N) per node, and if all nodes get assigned in one go then up to O(N^2) for all of them. But hey, it turns out that the computation of the hashes in the first place is still more expensive, taking O(N^2 * log(N)). So the totals would not change from the algorithm without the hash conflict resolution! It will still be O(N^4 * log(N)). Though of course the proportionality constant would change and the algorithm will become slower, and the optimization that skips the sorting to reduce the power of N won't be available any more.

When collating a large number of graphs, it's possible to optimize by running the version without collection detection first, and doing the first collation based on it. What if a collision happens? A different node might be picked to be assigned the same ID, so two equivalent graphs will end up with the different signatures. However since the signature of the graph includes its full topology, two different graphs will never get the same signature. The worst that can happen is that the equivalent graphs will be split between multiple buckets. So then the second pass can be run, computing the signatures with the hash collision resolution. This second run will combine the buckets that have been improperly split by the first pass. The benefit here is that instead of computing the more complex signature for each original graph, we would be computing it only once per bucket from the first pass. And the number of buckets can easily be a few orders of magnitude lower than the number of graphs.

If we're interested only in the top few buckets, we could even skip the whole long tail. But then we have to guard against the case of a popular graph being split into multiple buckets by the collisions. This can be resolved with 2-level bucketing. Along with the collision-agnostic signature, compute the "simple hash" of the graph by taking H(0) of all its nodes, ordering them and combining them. This simple hash will have the properties opposite to the collision agnostic signature: it would never split the equivalent graphs but it might combine the different graphs. So the simple hash is "optimistic" while the collision-agnostic signature is "pessimistic". Make the first level of bucketing optimistic and the second level of bucketing pessimistic. Then we can start by throwing away the long tail based on the optimistic bucketing. Then compute the exact signatures with collision resolution for each pessimistic class, combining them when necessary, and pick the final top elements.

<<Prev Next>>

## Monday, June 11, 2018

### graph equivalence 4: why it works

<<Prev Next>>

The general graphs are worse, the "redundant links" in them that aren't in the tree present a major problem. Take a look at the "bowtie" graph from the part 3:

If we try to follow the algorithm for the tree, start from the root C and go into the subtree rooted at A, then we would have to go to the 2nd-level subtree rooted at B, then to the 3rd-level subtree rooted at C, and get stuck in an endless recursion. This is not good.

So my first successful attempt at solving this was:

I believe this would work but in a kind of slow and painful way, even slower than the algorithm for the trees (although for all I can tell, it's still better than the factorial complexity, just the power of the polynome is kind of high).

Then I've come up with the hash-based version that I described first, and then I've found the explanation of how it works.

Basically, the problem with the algorithm for the tree is that it recurses in depth, and with loops present this depth goes to infinity. The alternative is to limit the depth and to build the signatures of all the nodes in parallel.

We start with the signatures of depth 0 for each node, formed of the tag on the node and of the sorted tags of the links connected to it.

The signatures of depth 1 are formed from the tag on the node, followed by the sorted links

The signatures of depth 2 are formed from the tag on the node, followed by the sorted links

And so on. We stop when all the signatures include all the nodes and all the links. But when we include all the nodes and do one more step, this guarantees that all the remaining links get included too, so this can be used as an equivalent stopping condition.

The signatures we get from this fully represent the topology of the graph, and are dependent only on the topology. So when we order the nodes according to these signatures, we can use the rest of the algorithm from the tree unchanged, assigning the IDs to the unique nodes in this order, and then picking one node from the first equivalency class.

Well, the hashes are the short and efficient way to represent these node signature strings, and ignoring the hash collisions, these algorithms are the same.

I'll describe the solution for the hash collisions in the next part.

<<Prev Next>>

The general graphs are worse, the "redundant links" in them that aren't in the tree present a major problem. Take a look at the "bowtie" graph from the part 3:

. A D |\ /| | C | |/ \| B E

If we try to follow the algorithm for the tree, start from the root C and go into the subtree rooted at A, then we would have to go to the 2nd-level subtree rooted at B, then to the 3rd-level subtree rooted at C, and get stuck in an endless recursion. This is not good.

So my first successful attempt at solving this was:

- For each node, find a spanning tree with this node as root. Build the tree from the shortest paths from the root to each node (not only in terms of the number of links in the path but also in terms of the sorting of the tags on the path).
- Build the root node's signature from this spanning tree.
- Then append the information about the redundant links that are not included in the tree:
- Each redundant link can be described as the tag on it plus the paths from both of its ends to the root. For two redundant links to be equivalent, their tags must be equal and all the subtrees rooted at every node included in these paths from the end to the root have to be equivalent.
- So build the lists of such subtrees for paths from both sides of the link, and order these two lists by the usual subtree comparison criteria.
- Then build the signature of the redundant link from its tag and two lists of subtrees, lists ordered between themselves by the subtree order.
- Then order the redundant links according to their signatures.
- Then add the signatures of the redundant links in this order to the end of root node's signature.
- And then the nodes can be compared by these signatures as before, and the rest of the algorithm is the same.

I believe this would work but in a kind of slow and painful way, even slower than the algorithm for the trees (although for all I can tell, it's still better than the factorial complexity, just the power of the polynome is kind of high).

Then I've come up with the hash-based version that I described first, and then I've found the explanation of how it works.

Basically, the problem with the algorithm for the tree is that it recurses in depth, and with loops present this depth goes to infinity. The alternative is to limit the depth and to build the signatures of all the nodes in parallel.

We start with the signatures of depth 0 for each node, formed of the tag on the node and of the sorted tags of the links connected to it.

The signatures of depth 1 are formed from the tag on the node, followed by the sorted links

*and*signatures of depth 0 of the nodes from the other ends of these links.The signatures of depth 2 are formed from the tag on the node, followed by the sorted links

*and*signatures of depth 1 of the nodes from the other ends of these links.And so on. We stop when all the signatures include all the nodes and all the links. But when we include all the nodes and do one more step, this guarantees that all the remaining links get included too, so this can be used as an equivalent stopping condition.

The signatures we get from this fully represent the topology of the graph, and are dependent only on the topology. So when we order the nodes according to these signatures, we can use the rest of the algorithm from the tree unchanged, assigning the IDs to the unique nodes in this order, and then picking one node from the first equivalency class.

Well, the hashes are the short and efficient way to represent these node signature strings, and ignoring the hash collisions, these algorithms are the same.

I'll describe the solution for the hash collisions in the next part.

<<Prev Next>>

### graph equivalence 3: why it works, simplified

<<Prev Next>>

To understand why the algorithm in part 1 works, let's take a step back and think, what would the topology of a node in a graph mean? If we have 2 nodes in the same graph, how can we tell that their topology is the same? And if it's not the same, how can we order them by topology? When this problem is solved, the signature generation becomes obvious: just find the topology of every node in the graph and order the nodes by topology.

Let's start with the simpler case, if the graph s a tree.Then the problem of the topological equivalency, or more exactly, of the ordering based on the topology, of two nodes is kind of easy. Obviously, first order them by the tags of the nodes themselves. Then for each node take all their subtrees and order them: first by the tag on the link that leads to the root of the subtree, then applying the same ordering logic recursively in the subtrees (to make the comparison faster, can also compare the number of nodes in the subtrees at some early point). After all the subtrees of each node are ordered, we can compare these lists of subtrees using the same order.

Then we can start writing the nodes into the signature based on this order. If all the nodes have a unique topology, the problem is solved: no matter in which order the nodes came in, they will be always sorted to the same order.

If some nodes are equivalent but there is only one class of equivalent nodes with multiple members, the problem is still solved: this means that any node that links to one of them must be linked to all of them. So these nodes can be listed sequentially in any order, and the signature checking logic will order the links to them according to that order. To give an example, let's look at a small version of the cross-shaped graph:

The node A is unique (or in other words, forms an equivalency class of one node), nodes (B, C, D, E) are equivalent. So we can order them in multiple ways:

Either way the ordered list of links for the node A will be (1, 2, 3, 4) and the ordered lists of links for nodes B, C, D E will be (0), so the signature is the same in any case.

But if there are multiple classes with multiple elements, we can't just pick all of them in any order. Let's have another look at the full cross-shaped graph from the part 1:

The node A is unique (or in other words, forms an equivalency class of one node), nodes (B, D, F, H) are equivalent, and nodes (C, E, G, I) are equivalent. If we order the nodes

then the signature will be:

If we order the nodes differently:

then the signature will be:

Which is a different signature, so something went wrong. The root of the problem is that as soon as we assign an index to a node, it becomes different from the nodes with the yet undefined indexes. The index sets it apart. So we can pick any node from the first class and assign an index to it, but after that we've got to re-compute the ordering of all the subtrees. The comparison function then must differentiate between the nodes with indexes assigned at the previous pass. So when comparing the (sub)trees, even before comparing the tags on the root nodes, it must compare the indexes of the root nodes. The node with the lowest index would go first, and the nodes without assigned indexes would go after all the nodes with the assigned indexes.

So after we pick the node B for index 1 (and thus place it into its own equivalency class), our equivalency classes change: (A), (B), (C), (D, F, H), (E, G, I). C falls into its own class because it's the only node whose direct subtrees include only the one rooting at B. After we follow through all of the equivalence classes, this produces

with the signature

If we picked the node D instead of B from the first class, the node E would become special, and the resulting indexes might be:

But the signature will still be the same!

So, to summarize:

It also had been occurring to me repeatedly that not only the first node from the first equivalence class can be assigned an index, but all of them can be assigned indexes at the same time. And as I just as repeatedly discover, this idea would work for a tree-shaped graph but not for the general graph. The reason for this is that although all the unique (and thus previously indexed) nodes are guaranteed to have links to all or none node of this class, the nodes inside this class aren't. They might be connected to only a subset of the other nodes in the same class. To the same number of them, to be sure, but to the different subsets. So the indexes can't be assigned to them in a random order, they have to take into account the cross-links within this class. That is, for a general graph. If the graph is a tree, the absence of the loops in it means that the only way for the equivalent nodes to be connected between themselves is if there are only two of them, and then they both are either connected to each other or not.

For an example of a graph with loops that would have this problem, consider this one (for now sidestepping the problem of how the ordering of the nodes is computed, it can be seen in such a small graph in an intuitive way):

The equivalence classes after the first iteration are (C), (A, B, D, E). But if we choose the order

we get the signature:

And with the order

we get the signature:

Which is wrong. But if we choose to assign the index 1 to A, then on the next iteration the node B becomes distinct from D and E, with the classes (C), (A), (B), (D, E). And after splitting the (D, E), eventually the signature is guaranteed to become

no matter which of the 4 nodes we picked on the first split-off, and which of the 2 remaining nodes we picked on the second split-off.

This part describes the algorithm and its informal proof of how to handle the tree-shaped graphs. The next installment will generalize to the arbitrary graphs.

<<Prev Next>>

To understand why the algorithm in part 1 works, let's take a step back and think, what would the topology of a node in a graph mean? If we have 2 nodes in the same graph, how can we tell that their topology is the same? And if it's not the same, how can we order them by topology? When this problem is solved, the signature generation becomes obvious: just find the topology of every node in the graph and order the nodes by topology.

Let's start with the simpler case, if the graph s a tree.Then the problem of the topological equivalency, or more exactly, of the ordering based on the topology, of two nodes is kind of easy. Obviously, first order them by the tags of the nodes themselves. Then for each node take all their subtrees and order them: first by the tag on the link that leads to the root of the subtree, then applying the same ordering logic recursively in the subtrees (to make the comparison faster, can also compare the number of nodes in the subtrees at some early point). After all the subtrees of each node are ordered, we can compare these lists of subtrees using the same order.

Then we can start writing the nodes into the signature based on this order. If all the nodes have a unique topology, the problem is solved: no matter in which order the nodes came in, they will be always sorted to the same order.

If some nodes are equivalent but there is only one class of equivalent nodes with multiple members, the problem is still solved: this means that any node that links to one of them must be linked to all of them. So these nodes can be listed sequentially in any order, and the signature checking logic will order the links to them according to that order. To give an example, let's look at a small version of the cross-shaped graph:

. E | D--A--B | C

The node A is unique (or in other words, forms an equivalency class of one node), nodes (B, C, D, E) are equivalent. So we can order them in multiple ways:

0 1 2 3 4 A B C D E A E D B C A C D E Band so on

Either way the ordered list of links for the node A will be (1, 2, 3, 4) and the ordered lists of links for nodes B, C, D E will be (0), so the signature is the same in any case.

But if there are multiple classes with multiple elements, we can't just pick all of them in any order. Let's have another look at the full cross-shaped graph from the part 1:

. I | H | E--D--A--B--C | F | G

The node A is unique (or in other words, forms an equivalency class of one node), nodes (B, D, F, H) are equivalent, and nodes (C, E, G, I) are equivalent. If we order the nodes

0 1 2 3 4 5 6 7 8 A B D F H C E G I

then the signature will be:

0: (1, 2, 3, 4) 1: (0, 5) 2: (0, 6) 3: (0, 7) 4: (0, 8) 5: (1) 6: (2) 7: (3) 8: (4)

If we order the nodes differently:

0 1 2 3 4 5 6 7 8 A B D F H I G E C

then the signature will be:

0: (1, 2, 3, 4) 1: (0, 8) 2: (0, 7) 3: (0, 6) 4: (0, 5) 5: (4) 6: (3) 7: (2) 8: (1)

Which is a different signature, so something went wrong. The root of the problem is that as soon as we assign an index to a node, it becomes different from the nodes with the yet undefined indexes. The index sets it apart. So we can pick any node from the first class and assign an index to it, but after that we've got to re-compute the ordering of all the subtrees. The comparison function then must differentiate between the nodes with indexes assigned at the previous pass. So when comparing the (sub)trees, even before comparing the tags on the root nodes, it must compare the indexes of the root nodes. The node with the lowest index would go first, and the nodes without assigned indexes would go after all the nodes with the assigned indexes.

So after we pick the node B for index 1 (and thus place it into its own equivalency class), our equivalency classes change: (A), (B), (C), (D, F, H), (E, G, I). C falls into its own class because it's the only node whose direct subtrees include only the one rooting at B. After we follow through all of the equivalence classes, this produces

0 1 2 3 4 5 6 7 8 A B C D E F G H I

with the signature

0: (1, 2, 3, 4) 1: (0, 2) 2: (1) 3: (0, 4) 4: (3) 5: (0, 6) 6: (5) 7: (0, 8) 8: (7)

If we picked the node D instead of B from the first class, the node E would become special, and the resulting indexes might be:

0 1 2 3 4 5 6 7 8 A D E F G H I B C

But the signature will still be the same!

So, to summarize:

- Ordering the nodes based on the comparison of their subtrees imposes a partial order of the nodes that is based only on the topology.
- We can then assign the indexes to all the unique nodes, in this order.
- Whenever there are equivalency classes of more than one node, an arbitrary node from the first such class can be separated by assigning the next index to it. Any of the nodes that are already unique would have links to either none or all of the nodes of this equivalence class, this is guaranteed by the equivalence, so the choice of the node from this class would not affect the links in the already-indexed nodes in any way.
- If there are other multi-way equivalence classes left, repeat the computation, now taking the previously assigned indexes into account when comparing the trees. If the nodes in these equivalence classes have an asymmetrical relation to the members of the split equivalence class, these nodes will become unique (or at least members of smaller equivalence classes). If their relations are symmetrical to all the nodes of the previously broken up class, they would stay together. But any remaining equivalence classes will be broken up on the following iterations.
- The resulting order of the nodes in the signature is only affected by the topological properties of the nodes, so the equivalent graphs are guaranteed to have the same signature.

It also had been occurring to me repeatedly that not only the first node from the first equivalence class can be assigned an index, but all of them can be assigned indexes at the same time. And as I just as repeatedly discover, this idea would work for a tree-shaped graph but not for the general graph. The reason for this is that although all the unique (and thus previously indexed) nodes are guaranteed to have links to all or none node of this class, the nodes inside this class aren't. They might be connected to only a subset of the other nodes in the same class. To the same number of them, to be sure, but to the different subsets. So the indexes can't be assigned to them in a random order, they have to take into account the cross-links within this class. That is, for a general graph. If the graph is a tree, the absence of the loops in it means that the only way for the equivalent nodes to be connected between themselves is if there are only two of them, and then they both are either connected to each other or not.

For an example of a graph with loops that would have this problem, consider this one (for now sidestepping the problem of how the ordering of the nodes is computed, it can be seen in such a small graph in an intuitive way):

. A D |\ /| | C | |/ \| B E

The equivalence classes after the first iteration are (C), (A, B, D, E). But if we choose the order

0 1 2 3 4 C A B D E

we get the signature:

0: (1, 2, 3, 4) 1: (0, 2) 2: (0, 1) 3: (0, 4) 4: (0, 3)

And with the order

0 1 2 3 4 C A D B E

we get the signature:

0: (1, 2, 3, 4) 1: (0, 3) 2: (0, 4) 3: (0, 1) 4: (0, 2)

Which is wrong. But if we choose to assign the index 1 to A, then on the next iteration the node B becomes distinct from D and E, with the classes (C), (A), (B), (D, E). And after splitting the (D, E), eventually the signature is guaranteed to become

0: (1, 2, 3, 4) 1: (0, 2) 2: (0, 1) 3: (0, 4) 4: (0, 3)

no matter which of the 4 nodes we picked on the first split-off, and which of the 2 remaining nodes we picked on the second split-off.

This part describes the algorithm and its informal proof of how to handle the tree-shaped graphs. The next installment will generalize to the arbitrary graphs.

<<Prev Next>>

## Sunday, June 10, 2018

### graph equivalence part 2: comparing the graph signatures

<<Prev Next>>

In the part 1 I've told how to build the signatures but haven't told how to compare them.

So, a signature of the graph is essentially the list of its nodes ordered in a specific way, by the algorithm from the part 1. The basic idea of comparing the signatures is to compare the first node, if they are equal then the next node, and so on. But how do we compare the nodes?

Obviously, first we compare the tags of the nodes. Then we've got to compare the links. But we've got to compare the links. We've got to check that the links with the same tags lead to the nodes at the same indexes in the list. The easy way to do this is to order the links by their tags in advance. And this happens to be the same order that was used during the signature building, so it can be just kept from there. But there still is the problem of what to do when the links have the same tags. In this case we've got to check that these links lead to the sets of the same nodes in both graphs. The easy way to do it is to sort these links by the indexes of the nodes they point to, which can also be done in advance.

So after these pre-computations the comparisons can be done pretty fast, maybe even as a simple byte-string comparison. And the computed node hashes can be checked up front to make the comparison faster. And these hashes can be further combined into one value to produce a hash for the hash map.

<<Prev Next>>

In the part 1 I've told how to build the signatures but haven't told how to compare them.

So, a signature of the graph is essentially the list of its nodes ordered in a specific way, by the algorithm from the part 1. The basic idea of comparing the signatures is to compare the first node, if they are equal then the next node, and so on. But how do we compare the nodes?

Obviously, first we compare the tags of the nodes. Then we've got to compare the links. But we've got to compare the links. We've got to check that the links with the same tags lead to the nodes at the same indexes in the list. The easy way to do this is to order the links by their tags in advance. And this happens to be the same order that was used during the signature building, so it can be just kept from there. But there still is the problem of what to do when the links have the same tags. In this case we've got to check that these links lead to the sets of the same nodes in both graphs. The easy way to do it is to sort these links by the indexes of the nodes they point to, which can also be done in advance.

So after these pre-computations the comparisons can be done pretty fast, maybe even as a simple byte-string comparison. And the computed node hashes can be checked up front to make the comparison faster. And these hashes can be further combined into one value to produce a hash for the hash map.

<<Prev Next>>

## Saturday, June 9, 2018

### graph equivalence 1: the overall algorithm

Next>>

I want to talk about some interesting graph processing I've done recently. As a part of a bigger problem, I needed to collate a few millions of (not that large) graphs, replacing every set of equivalent graphs with a single graph and a count. I haven't found much on the internets about the graph equivalency, all I've found is people asking about it and other people whining that it's a hard problem. Well, yeah, if you brute-force it, it's obviously a factorial-scale problem. But I believe I've found a polynomial solution for it, with not such a high power, and I want to share it. It's kind of simple in the hindsight but it took me four versions of the algorithm and a couple of months of background thinking to get there, so I'm kind of proud of it.

But first, the kind of obvious thing that didn't take me any time to think about at all: if you're going to collate a large number of values that have a large degree of repeatability among them, you don't want to compare each one to each one. You want to do at least a logarithmic search, or even better a hash table. Which means that you need to not just compare for equality or inequality but to impose an order, and/or generate a value that can be used as a hash. If you do the factorial-complexity comparison of two graphs, you get neither, all you know is whether they are equal but you don't have any order between the unequal graphs, nor anything you can use as a hash. Instead, you need a signature: something that would serve as a "standard representation" of an equivalency class of graphs and that can be compared in a linear time (you could even represent the signatures as strings, and compare them as strings if you want). And then you can easily impose order and compute hashes on these signatures.

The real problem then is not to compare the graphs but to find such signatures of the graphs. What should these signatures be? If we can order the nodes of a graph based purely on their place in the topology of the graph, that would work as a signature. A graph might have multiple nodes with the same topology (i.e. belonging to the same equivalence class), then any reordering of these nodes would provide the same signature. In a graph with N nodes (vertices) and L links (edges), this signature could be represented as a string of N+L elements, and thus compared in a linear time.

The complication is that if there are multiple equivalence classes, there might be interdependencies between there classes. For example, consider an untagged graph in the shape of a cross (A...I here are not the proper tags on the nodes but some temporary random names that we assign for the sake of discussion):

The nodes (B, D, F, H) are equivalent among themselves, and the nodes (C, E, G, I) are also equivalent among themselves. So we could interchange their positions within each equivalence class. But once we pick the positions for nodes of one class, the positions of the other class become fixed. Once we assign a certain position in the signature to the node B, we have to assign a certain related position to the node C. We could swap the positions of nodes B and D, but then we have to also swap the positions of nodes C and E accordingly.

First let me tell the final algorithm I came up with, and then I'll explain why it works. The algorithm uses hashes, so let's also initially pretend that there are no such things as hash collisions, and then I'll show how such collisions can be resolved.

The algorithm works on the graphs that may have tags (or labels, whatever term you prefer) on the nodes and links. The links may be directional, and this can also be expresses as a tag on them. The nodes might have some tags on the "ports" where the links attach, such tags can also be "moved" to the links. The algorithm uses the "node-centric" view of the tags on the links, i.e. the specific tag used by it depends on which node is used as a viewpoint. For example, suppose we have nodes A and B and a directional link with tag X that goes from port 1 on A to port 2 on B:

When looking from the node A, such a link can be seen as having the tag "O, 1, X, 2". "O" stands for "outgoing", then the local port, tag of the link itself, the remote port. When looking from the node B, the same link will have the tag "I, 2, X, 1".

The nodes get ordered into the signature by assigning them the sequential IDs starting from 0. The algorithm starts with no IDs assigned, and assigns them as it works.

The algorithm computes the hashes of the graph's topology around each node, up to the distance (you can also think of it as radius) D. The distance 0 includes the tag on the node itself and the tags on the links directly connected to it. The distance 1 also includes the immediately neighboring nodes and links connected to them, and so on. For each node, we keep growing the distance by 1 and computing the new hashes. It's easy to see that when we get to the distance D=N, we're guaranteed to include the topology of the graph to all the nodes. But a general graph is not a tree, it also contains the "redundant" links, and to include the nodes on the other side of all the redundant links, one more step would be necessary, to D=N+1. The inclusion of these nodes for the second time is needed to include the full topology of these redundant links.

However for most graphs the maximal radius will be smaller. If each node is connected to each node, the whole graph will be included at D=1. So we don't have to iterate all the way to N, instead we can keep the set of the included nodes for each hash, and we can stop growing when this set includes all the nodes in the graph.

Let's call these hashes H, and the node sets S.

The hash H(0) and set of nodes S(0) for the distance 0 around a node is computed as follows:

1. Include this node itself into the set of nodes S(0).

2. If the node has an ID assigned, include this ID into the hash and stop. Otherwise continue with the next steps.

3. Include the tag of the node itself into the hash H(0).

4. Order the links connected to this node in some fixed way (for example, by comparing the components of their tags in the order shown in the example above: the direction, local port, tag of the link itself, remote port).

5. Include the tags of the links in this order into the hash H(0).

The hash H(D+1) and set of nodes S(D+1) for the distance D+1 around a node is computed by induction as follows:

1. Combine S(D) of this node and of all the directly connected nodes to produce S(D+1).

2. If the node has an ID assigned, assume H(D+1) = H(0) and stop. Otherwise continue with the next steps.

3. Include the hash H(0) at distance 0 of the node itself into H(D+1).

4. Order the links connected to this node in the same way by the tags, but with an addition of the information about the nodes at the other end of the link: if some links have equal tags, order them according to the previous hash H(D) of the nodes at the other ends of these links.

5. Include the tags of the links and H(D) of the nodes at the other ends of the links into H(D+1).

The rest of the algorithm then is:

1. While there are any nodes with unassigned ID:

1.1. Repeat for D=0...N+1

1.1.1. For each node, compute H(D), S(D).

1.1.2. Order nodes in the ascending order by (ID, H(D)), the unassigned IDs go after all the assigned IDs.

1.1.3 If all S(D)==S(D-1), break out of the loop (it means that the whole topology is included).

1.1.4. See if any of the nodes with unassigned ID have unique H(D). If any of them do, assign the IDs for them, continuing from the last assigned ID, and going in the order of their H(D). Re-compute their H(0...D) based on the newly assgined ID.

1.2. If there are still any nodes with unassgined ID, take one arbitrary node with the lowest H(D) and assign the next ID in sequence to it.

Let's estimate the run time:

At least one ID gets assigned on each iteration of the outer loop, so there would be at most N iterations. The inner loop would also have at most N+1 iterations. The most expensive operations in it are the computation of the next H(D) and the sorting. The computation of H(D) would go through N nodes and would do the sorting of the links connected to this node that would take O(EperN*log(EperN)) steps. The sorting of the nodes would take O(N*log(N)) steps. So to cover both these cases, let's take O(N*N*log(N)) as the worst case that would cover them both. So the total comes to O(N^4 * log(N)), which is much better than the factorial.

The resolution of the hash collisions would add to this but on the other hand, some optimizations are possible.

When we compute H(D+1), the links can be sorted by their tags only once, with the further sorting only among the links with the same tags. This sorting is needed to make sure that these links are treated as a set, that no matter in which order the remote nodes are originally seen, the set of remote nodes with the matching H(D) would produce the same H(D+1). But there is another, more hacky but faster way to achieve that: combine the hashes of nodes at equivalent links in a commutative way, such as by addition, before including them into H(D+1). This would remove one power of N, and the total would be O(N^3 * log(N)).

Also, as soon as a node gets its ID, its hash becomes fixed, so there is no need to compute it any more. And we can skip computing the set S for them too. This means that the other nodes would not necessarily include the whole graph into their set S, since the propagation would stop at the nodes with the assigned IDs. But that's fine, because the inner loop would stop anyway when all the sets S stop changing. So the inner loop has to work only on the nodes that have no ID assigned yet.

More on the hash collision resolution, and on why this works at all, in the next installment(s).

P.S. The description of how to compare signatures is also in the next installment.

P.P.S. An actual implementation example can be found in https://github.com/tensorflow/tensorflow/blob/ab8e195d2e0978c21234a5632d4fabf47535eda1/tensorflow/core/grappler/graph_analyzer/sig_node.h (and the nearby files).

Next>>

I want to talk about some interesting graph processing I've done recently. As a part of a bigger problem, I needed to collate a few millions of (not that large) graphs, replacing every set of equivalent graphs with a single graph and a count. I haven't found much on the internets about the graph equivalency, all I've found is people asking about it and other people whining that it's a hard problem. Well, yeah, if you brute-force it, it's obviously a factorial-scale problem. But I believe I've found a polynomial solution for it, with not such a high power, and I want to share it. It's kind of simple in the hindsight but it took me four versions of the algorithm and a couple of months of background thinking to get there, so I'm kind of proud of it.

But first, the kind of obvious thing that didn't take me any time to think about at all: if you're going to collate a large number of values that have a large degree of repeatability among them, you don't want to compare each one to each one. You want to do at least a logarithmic search, or even better a hash table. Which means that you need to not just compare for equality or inequality but to impose an order, and/or generate a value that can be used as a hash. If you do the factorial-complexity comparison of two graphs, you get neither, all you know is whether they are equal but you don't have any order between the unequal graphs, nor anything you can use as a hash. Instead, you need a signature: something that would serve as a "standard representation" of an equivalency class of graphs and that can be compared in a linear time (you could even represent the signatures as strings, and compare them as strings if you want). And then you can easily impose order and compute hashes on these signatures.

The real problem then is not to compare the graphs but to find such signatures of the graphs. What should these signatures be? If we can order the nodes of a graph based purely on their place in the topology of the graph, that would work as a signature. A graph might have multiple nodes with the same topology (i.e. belonging to the same equivalence class), then any reordering of these nodes would provide the same signature. In a graph with N nodes (vertices) and L links (edges), this signature could be represented as a string of N+L elements, and thus compared in a linear time.

The complication is that if there are multiple equivalence classes, there might be interdependencies between there classes. For example, consider an untagged graph in the shape of a cross (A...I here are not the proper tags on the nodes but some temporary random names that we assign for the sake of discussion):

. I | H | E--D--A--B--C | F | G

The nodes (B, D, F, H) are equivalent among themselves, and the nodes (C, E, G, I) are also equivalent among themselves. So we could interchange their positions within each equivalence class. But once we pick the positions for nodes of one class, the positions of the other class become fixed. Once we assign a certain position in the signature to the node B, we have to assign a certain related position to the node C. We could swap the positions of nodes B and D, but then we have to also swap the positions of nodes C and E accordingly.

First let me tell the final algorithm I came up with, and then I'll explain why it works. The algorithm uses hashes, so let's also initially pretend that there are no such things as hash collisions, and then I'll show how such collisions can be resolved.

The algorithm works on the graphs that may have tags (or labels, whatever term you prefer) on the nodes and links. The links may be directional, and this can also be expresses as a tag on them. The nodes might have some tags on the "ports" where the links attach, such tags can also be "moved" to the links. The algorithm uses the "node-centric" view of the tags on the links, i.e. the specific tag used by it depends on which node is used as a viewpoint. For example, suppose we have nodes A and B and a directional link with tag X that goes from port 1 on A to port 2 on B:

. X A[1] -------->[2]B

When looking from the node A, such a link can be seen as having the tag "O, 1, X, 2". "O" stands for "outgoing", then the local port, tag of the link itself, the remote port. When looking from the node B, the same link will have the tag "I, 2, X, 1".

The nodes get ordered into the signature by assigning them the sequential IDs starting from 0. The algorithm starts with no IDs assigned, and assigns them as it works.

The algorithm computes the hashes of the graph's topology around each node, up to the distance (you can also think of it as radius) D. The distance 0 includes the tag on the node itself and the tags on the links directly connected to it. The distance 1 also includes the immediately neighboring nodes and links connected to them, and so on. For each node, we keep growing the distance by 1 and computing the new hashes. It's easy to see that when we get to the distance D=N, we're guaranteed to include the topology of the graph to all the nodes. But a general graph is not a tree, it also contains the "redundant" links, and to include the nodes on the other side of all the redundant links, one more step would be necessary, to D=N+1. The inclusion of these nodes for the second time is needed to include the full topology of these redundant links.

However for most graphs the maximal radius will be smaller. If each node is connected to each node, the whole graph will be included at D=1. So we don't have to iterate all the way to N, instead we can keep the set of the included nodes for each hash, and we can stop growing when this set includes all the nodes in the graph.

Let's call these hashes H, and the node sets S.

The hash H(0) and set of nodes S(0) for the distance 0 around a node is computed as follows:

1. Include this node itself into the set of nodes S(0).

2. If the node has an ID assigned, include this ID into the hash and stop. Otherwise continue with the next steps.

3. Include the tag of the node itself into the hash H(0).

4. Order the links connected to this node in some fixed way (for example, by comparing the components of their tags in the order shown in the example above: the direction, local port, tag of the link itself, remote port).

5. Include the tags of the links in this order into the hash H(0).

The hash H(D+1) and set of nodes S(D+1) for the distance D+1 around a node is computed by induction as follows:

1. Combine S(D) of this node and of all the directly connected nodes to produce S(D+1).

2. If the node has an ID assigned, assume H(D+1) = H(0) and stop. Otherwise continue with the next steps.

3. Include the hash H(0) at distance 0 of the node itself into H(D+1).

4. Order the links connected to this node in the same way by the tags, but with an addition of the information about the nodes at the other end of the link: if some links have equal tags, order them according to the previous hash H(D) of the nodes at the other ends of these links.

5. Include the tags of the links and H(D) of the nodes at the other ends of the links into H(D+1).

The rest of the algorithm then is:

1. While there are any nodes with unassigned ID:

1.1. Repeat for D=0...N+1

1.1.1. For each node, compute H(D), S(D).

1.1.2. Order nodes in the ascending order by (ID, H(D)), the unassigned IDs go after all the assigned IDs.

1.1.3 If all S(D)==S(D-1), break out of the loop (it means that the whole topology is included).

1.1.4. See if any of the nodes with unassigned ID have unique H(D). If any of them do, assign the IDs for them, continuing from the last assigned ID, and going in the order of their H(D). Re-compute their H(0...D) based on the newly assgined ID.

1.2. If there are still any nodes with unassgined ID, take one arbitrary node with the lowest H(D) and assign the next ID in sequence to it.

Let's estimate the run time:

At least one ID gets assigned on each iteration of the outer loop, so there would be at most N iterations. The inner loop would also have at most N+1 iterations. The most expensive operations in it are the computation of the next H(D) and the sorting. The computation of H(D) would go through N nodes and would do the sorting of the links connected to this node that would take O(EperN*log(EperN)) steps. The sorting of the nodes would take O(N*log(N)) steps. So to cover both these cases, let's take O(N*N*log(N)) as the worst case that would cover them both. So the total comes to O(N^4 * log(N)), which is much better than the factorial.

The resolution of the hash collisions would add to this but on the other hand, some optimizations are possible.

When we compute H(D+1), the links can be sorted by their tags only once, with the further sorting only among the links with the same tags. This sorting is needed to make sure that these links are treated as a set, that no matter in which order the remote nodes are originally seen, the set of remote nodes with the matching H(D) would produce the same H(D+1). But there is another, more hacky but faster way to achieve that: combine the hashes of nodes at equivalent links in a commutative way, such as by addition, before including them into H(D+1). This would remove one power of N, and the total would be O(N^3 * log(N)).

Also, as soon as a node gets its ID, its hash becomes fixed, so there is no need to compute it any more. And we can skip computing the set S for them too. This means that the other nodes would not necessarily include the whole graph into their set S, since the propagation would stop at the nodes with the assigned IDs. But that's fine, because the inner loop would stop anyway when all the sets S stop changing. So the inner loop has to work only on the nodes that have no ID assigned yet.

More on the hash collision resolution, and on why this works at all, in the next installment(s).

P.S. The description of how to compare signatures is also in the next installment.

P.P.S. An actual implementation example can be found in https://github.com/tensorflow/tensorflow/blob/ab8e195d2e0978c21234a5632d4fabf47535eda1/tensorflow/core/grappler/graph_analyzer/sig_node.h (and the nearby files).

Next>>

## Tuesday, May 29, 2018

### SQL "With" statement

I've recently learned about the WITH statement in SQL that had apparently been there since the 1999 standard. It pretty much allows you to define the "temporary variables" (or "temporary views") in the SQL query for common sub-expressions, like this:

This solves a major problem with the SQL statements, especially for self-joins on such pre-computed data, and lets you do fork-and-join in a single statement. I'm surprised that it wasn't used all over the place for the CEP systems.

WITH name1 AS (SELECT ...), name2 AS (SELECT ...) SELECT... FROM name1, name2, whatever ...

This solves a major problem with the SQL statements, especially for self-joins on such pre-computed data, and lets you do fork-and-join in a single statement. I'm surprised that it wasn't used all over the place for the CEP systems.

## Monday, March 19, 2018

### Comments and names

I've read an opinion that if the names of the methods are self-explanatory, the comments describing these methods are not needed.

Well, sometimes it's so, but in general I strongly disagree with this idea. The names and comments have differnt purposes. The point of the name is to give the method a designation that is difficult to confuse with the other methods. The point of the comment is to explain the meaning of a method. Mixing them doesn't do well.

The attempts to make the names more self-explanatory back-fire in two ways: the code becomes longer, and the names become less distinct. And the worst part, it still doesn't explain enough, you can't understand such an uncommented method until you read its implementation. A good comment must explain the various considerations that went into the development of the method, its pre-conditions and post-conditions, and the exact meaning of parameters. There is simply no way to fit it into a name.

In a way, a method name is similar to an acronym in the texts. Or, well, not necessarily an acronym but some short meaningful name. The first time you use this acronym or word, you explain what it means. When you define a method, you define what it means in the comment, and give it a short name. Then, as you use the word in the rest of the text, you use the method name in the program. And if the reader forgets what it means, he can always look up the explanation. If you used the full multi-word definition throughout your text, it would grow completely unreadable.

Well, sometimes it's so, but in general I strongly disagree with this idea. The names and comments have differnt purposes. The point of the name is to give the method a designation that is difficult to confuse with the other methods. The point of the comment is to explain the meaning of a method. Mixing them doesn't do well.

The attempts to make the names more self-explanatory back-fire in two ways: the code becomes longer, and the names become less distinct. And the worst part, it still doesn't explain enough, you can't understand such an uncommented method until you read its implementation. A good comment must explain the various considerations that went into the development of the method, its pre-conditions and post-conditions, and the exact meaning of parameters. There is simply no way to fit it into a name.

In a way, a method name is similar to an acronym in the texts. Or, well, not necessarily an acronym but some short meaningful name. The first time you use this acronym or word, you explain what it means. When you define a method, you define what it means in the comment, and give it a short name. Then, as you use the word in the rest of the text, you use the method name in the program. And if the reader forgets what it means, he can always look up the explanation. If you used the full multi-word definition throughout your text, it would grow completely unreadable.

## Sunday, December 17, 2017

### waiting for a condition: another way

I've recently encountered another way of doing synchronization without condition variables in the Abseil library: their mutex has the method Await(). It provides an alternative to the construction (copied from their description):

as

Here the condition is given in the argument of Await() as a functor, an object that keeps the related context and executes a function in this context, in the new-fangled C++14 code that's usually given as a lambda.

This works by evaluating all the currently requested conditions on unlock, and if one of them becomes true, waking up the appropriate thread. Looks kind of wasteful on the first sight but there are good sides to it too.

One of the typical problems with the classic condition variables is "how fine-grained should they be made"? Should there be a separate condition variable for every possible condition or can the conditions be mixed sometimes on the same variable? The Await() solves this problem: it can easily allocate a separate variable per condition, and do it right on the stack in Await(). The obvious problem with multiple conditions sharing the same condition variable is that then on a signal multiple threads would wake up, some of them will find their condition and continue, some will not find it and go back to sleep, creating the churn. Await() clearly solves this problem.

But it can do one better too. Another problem is when the right thread wakes up but has to go back to sleep because before it woke up some other thread snuck in, got the mutex and ruined the condition. Await() can guarantee that it won't happen by simply keeping the mutex locked but passing its ownership to the waiting thread. Then when the waiting thread wakes up, it doesn't even need to check the condition again, because it's been already checked for it. This is one of those times where having your own low-level thread library pays off, you couldn't do such a trick with a Posix mutex.

So when they say that it about evens out on performance with the classic condition variables, I think there are good reasons to believe that.

mu.Lock(); ... // arbitrary code A while (!f(arg)) { mu.cv.Wait(&mu); } ... // arbitrary code B mu.Unlock();

as

mu.Lock(); ... // arbitrary code A mu.Await(Condition(f, arg)); ... // arbitrary code B mu.Unlock();

Here the condition is given in the argument of Await() as a functor, an object that keeps the related context and executes a function in this context, in the new-fangled C++14 code that's usually given as a lambda.

This works by evaluating all the currently requested conditions on unlock, and if one of them becomes true, waking up the appropriate thread. Looks kind of wasteful on the first sight but there are good sides to it too.

One of the typical problems with the classic condition variables is "how fine-grained should they be made"? Should there be a separate condition variable for every possible condition or can the conditions be mixed sometimes on the same variable? The Await() solves this problem: it can easily allocate a separate variable per condition, and do it right on the stack in Await(). The obvious problem with multiple conditions sharing the same condition variable is that then on a signal multiple threads would wake up, some of them will find their condition and continue, some will not find it and go back to sleep, creating the churn. Await() clearly solves this problem.

But it can do one better too. Another problem is when the right thread wakes up but has to go back to sleep because before it woke up some other thread snuck in, got the mutex and ruined the condition. Await() can guarantee that it won't happen by simply keeping the mutex locked but passing its ownership to the waiting thread. Then when the waiting thread wakes up, it doesn't even need to check the condition again, because it's been already checked for it. This is one of those times where having your own low-level thread library pays off, you couldn't do such a trick with a Posix mutex.

So when they say that it about evens out on performance with the classic condition variables, I think there are good reasons to believe that.

## Saturday, September 30, 2017

### removing the duplication in Bayesian & AdaBoost

I've been thinking some more on how to remove the duplication between the Bayesian events/AdaBoost partial hypotheses. Here are some notes. They're not some final results, just some thinking aloud.

I think the problem of removing the commonality can be boiled down to the following: Suppose we have an event A that gives the prediction with some effect

A: a = a' + ab

B: b = b' + ab

Where

x + y = y + x

(x + y) + z = x + (y + z)

No, I don't have a proof why they do. But the problem with the AdaBoost hypotheses that I'm trying to solve is that the result there depends on the order of partial hypotheses (i.e. Bayesian events). I want to be able to decompose these events in such a way as to make the result independent of the order, hence the commutative requirement. The exact operation that satisfies these rules will need to be found but for now we can follow through the results of these rules.

So, going by these rules, if we apply both events, that will be:

a + b = a' + ab + b' + ab = a' + b' + 2*ab

The part

A': a'

B': b'

AB:ab

a = a' + ab; b = b' +ab

The new event AB gets the common part, while A and B lose it and become A' and B'. Then with all events applied we get the result without duplication:

a' + b' +ab

If we want to add the fourth event C, we've got to get its overlap with all 3 previous events. This requires a whole bunch of multi-letter identifiers:

a'c would be the overlap between a' andc

b'c would be the overlap between b' and c

abc would be the overlap between ab and c

And the double-primes would be the "remainders":

a' = a'' + a'c

b' = b'' + b'c

ab = ab' +abc

c = c' + a'c + b'c +abc

Then the result of applying all these events without duplication will be:

a'' + a'c + b'' + b'c + ab' + abc + c'

This gives the outline of the solution but it still has two problems: the number of events after splitting grows exponentially (a power of 2) with the number of the original events, and the concrete meaning of the operation "+" needs to be defined. And actually there is one more operation that needs to be defined: what does the "overlap" mean and how do we compute the value of

The answers to both problems might be connected. One way to define the overlap would be to associate each training case with exactly one event that predicts it well. Then when we split

Then the problem with the exponential growth can be answered in a two-fold way. First, the number of the events after splitting can't grow higher than the number of the training cases. Second, we can put an artificial cap on the number of events that we're willing to use (Emax), and select up to this many events, ones that predict the most training cases each. We can then either sweep the remaining training cases under the carpet, saying that they are one-offs that would cause overfitting, or split them somehow with partial strength among the events that get included.

The last sentence also suggests the third way of tackling the growth: if we define some way to do the splitting, instead of multiplying the events we could just split the power they have on a particular training case. But I think this would create issues in case of the partial confidence during the execution of the model that can be handled better with the combined events.

Returning to the definition of "+", I couldn't figure out yet how to make such an operation directly in AdaBoost. Maybe it can be done through logarithms, that will need some more thinking. It requires the ability to say that some training case doesn't get affected by a partial hypothesis/event. But there seems to be an easy way to do it in the Bayesian way: mark the confidence of that event for that training case as 0.5. And then these transformed events/partial hypotheses can be fed into AdaBoost too.

This fits very well with the underlying goal of boosting: to find the approximately equal rate of the predominantly correct votes for each training case. After the transformation there will be exactly one right vote for each training case, and the rest of votes will be "undecided" (or at least sort of, subject to the limitations introduced by the mixing of training cases).

Another consequence is that such splitting would make the produced events fully independent of each other: each event would have an exactly 50% correlation with each other event, meaning that they can be applied in any order. And this is exactly what we've been aiming for.

So it all looks good and simple but I'm not sure yet if I've missed something important that will take the whole construction apart. There might well be.

P.S. After a little thinking I've realized that the idea of number of events being limited by the number of training cases is pretty much equivalent to the computation directly on the training cases by weight.

I think the problem of removing the commonality can be boiled down to the following: Suppose we have an event A that gives the prediction with some effect

*a*on the training cases, and an event B that gives the prediction*b*on the training cases. These predictions have some part*ab*(that's a 2-letter identifier, not multiplication of*a*and*b*) in common. Then the effects can be written as:A: a = a' + ab

B: b = b' + ab

Where

*a'*and*b'*are the differences between the original*a*and*ab*, and*b*and*ab*. Here the operator "+" means not the addition but application of the events in the deduction one after another. So "a + b" means "apply a, then apply b". The reason why I chose the sign "+" is that this combined application should act like addition. I.e. have the commutative and distributive properties likex + y = y + x

(x + y) + z = x + (y + z)

No, I don't have a proof why they do. But the problem with the AdaBoost hypotheses that I'm trying to solve is that the result there depends on the order of partial hypotheses (i.e. Bayesian events). I want to be able to decompose these events in such a way as to make the result independent of the order, hence the commutative requirement. The exact operation that satisfies these rules will need to be found but for now we can follow through the results of these rules.

So, going by these rules, if we apply both events, that will be:

a + b = a' + ab + b' + ab = a' + b' + 2*ab

The part

*ab*gets applied twice. To remove this duplication we've got rid of one copy of*ab*. We can do this by splitting the original two events into three events A', B' and AB (again, that's a 2-letter identifier, not a multiplication):A': a'

B': b'

AB:ab

a = a' + ab; b = b' +ab

The new event AB gets the common part, while A and B lose it and become A' and B'. Then with all events applied we get the result without duplication:

a' + b' +ab

If we want to add the fourth event C, we've got to get its overlap with all 3 previous events. This requires a whole bunch of multi-letter identifiers:

a'c would be the overlap between a' andc

b'c would be the overlap between b' and c

abc would be the overlap between ab and c

And the double-primes would be the "remainders":

a' = a'' + a'c

b' = b'' + b'c

ab = ab' +abc

c = c' + a'c + b'c +abc

Then the result of applying all these events without duplication will be:

a'' + a'c + b'' + b'c + ab' + abc + c'

This gives the outline of the solution but it still has two problems: the number of events after splitting grows exponentially (a power of 2) with the number of the original events, and the concrete meaning of the operation "+" needs to be defined. And actually there is one more operation that needs to be defined: what does the "overlap" mean and how do we compute the value of

*ab*from the values of*a*and*b*?The answers to both problems might be connected. One way to define the overlap would be to associate each training case with exactly one event that predicts it well. Then when we split

*a*and*b*into*a'*,*b'*, and*ab*, we would be splitting the whole set of training cases into 3 parts: one part gets predicted well by both A and B, one only by A, and one only by B.Then the problem with the exponential growth can be answered in a two-fold way. First, the number of the events after splitting can't grow higher than the number of the training cases. Second, we can put an artificial cap on the number of events that we're willing to use (Emax), and select up to this many events, ones that predict the most training cases each. We can then either sweep the remaining training cases under the carpet, saying that they are one-offs that would cause overfitting, or split them somehow with partial strength among the events that get included.

The last sentence also suggests the third way of tackling the growth: if we define some way to do the splitting, instead of multiplying the events we could just split the power they have on a particular training case. But I think this would create issues in case of the partial confidence during the execution of the model that can be handled better with the combined events.

Returning to the definition of "+", I couldn't figure out yet how to make such an operation directly in AdaBoost. Maybe it can be done through logarithms, that will need some more thinking. It requires the ability to say that some training case doesn't get affected by a partial hypothesis/event. But there seems to be an easy way to do it in the Bayesian way: mark the confidence of that event for that training case as 0.5. And then these transformed events/partial hypotheses can be fed into AdaBoost too.

This fits very well with the underlying goal of boosting: to find the approximately equal rate of the predominantly correct votes for each training case. After the transformation there will be exactly one right vote for each training case, and the rest of votes will be "undecided" (or at least sort of, subject to the limitations introduced by the mixing of training cases).

Another consequence is that such splitting would make the produced events fully independent of each other: each event would have an exactly 50% correlation with each other event, meaning that they can be applied in any order. And this is exactly what we've been aiming for.

So it all looks good and simple but I'm not sure yet if I've missed something important that will take the whole construction apart. There might well be.

P.S. After a little thinking I've realized that the idea of number of events being limited by the number of training cases is pretty much equivalent to the computation directly on the training cases by weight.

### AdaBoost conjunctions & XOR

I've been thinking recently about the better ways to extract the commonality from two Bayesian events (hope to find time to write about that soon), and that brought me back to the XOR problem: how to handle the situation when two events indicate a certain hypothesis when they are the same and another hypothesis when they are different.

BTW, I took a class on machine learning recently, and it turns out that even with the neural network this problem doesn't have a great solution. The neural networks are subject to the random initial state before the training, and sometimes they do find the good combinations and sometimes they don't. This tends to get solved more reliably by the human intervention: a data scientist stares at the data, maybe does a few training attempts to see what happens, and then manually adds a new event that represents the XOR of the original events.

So, thinking about it, I've remembered again about the concept of conjunctions that I've read in the AdaBoost book. I picked up the book and tried to find the conjunctions but couldn't: they're not anywhere in the index. Fortunately, an Internet search turned up the online copy of the book, and the conjunctions are in the chapter 9.

The idea there is that the quality of AdaBoost partial hypotheses can be improved by forming them as conjunctions of two or more raw partial hypotheses. The idea is to take the two best choices and try a conjunction on them. If it gets better, try adding the third one.

For a simple example, consider the data with two parameters, with values 1 collected in the upper-right quarter, and 0 in the remaining three quarters:

The two partial hypotheses (x > 0) and (y > 0) taken together but separately will identify the upper right quarter quite decently. Each of them will have a 75% success rate. But the values in the upper-left and lower-right corner will get one vote "for" and one vote "against" from these hypotheses and will be left undecided. I actually think that the classic AdaBoost would not be able to decide well for them. It would probably crowd the field with the extra hypotheses in the upper-right corner like (x > 1), (y > 1), (x > 2), (y > 2) and so on so that the upper-left and lower-right quarters would get better but the parts of the upper-right close to the boundary would get worse. If we take the classic Bayesian approach and consider that the prior probability of encountering a 0 is 75% (assuming that all the quarters are filled at the same density), the "undecided" result would leave the probability of 0 at the same 75%, and it would be a vote for 0. So this is probably a good example of how AdaBoost could benefit from using the prior probabilities.

The conjunctions are good at handling such situations: if these partial hypotheses are combined into one conjunction (x > 0) & (y > 0), we get a hypothesis with 100% success.

They can be used to handle the XOR situation as well. Let's look at the situation with 1s in upper-right and lower-left corners:

The two partial hypotheses formed by conjunctions, ((x > 0) & (y > 0)) and ((x < 0) & (y < 0)), would each give the 75% success rate. The quadrants with 0s would get correctly voted by both hypotheses but 1s would get one vote for and one vote against. But the similar hypotheses can be added for the other two quarter: !((x > 0) & (y < 0)), !((x < 0) & (y > 0)). These would add two correct votes for the quadrants with 1s, and two opposite voted canceling each other for the 0s, and taken together these 4 hypotheses can identify everything correctly.

An interesting thing to note here is that the raw hypotheses like (x > 0) aren't the best ones at selectivity in this example. They're actually the worst ones, having the exact 50% correctness. But then again, pretty much all the hypotheses with vertical and horizontal boundaries here will have about 50% selectivity. Hunting for slightly better than 50% by selecting boundaries that fit the random density fluctuations would actually make things worse, not hitting the right center. I actually see no way to select the right boundaries by choosing the raw hypotheses independently. Only taken together in a conjunction they become optimizable.

The other way to do it would be to use XORs instead of conjunctions. The case with two diagonal quadrants of 1s is trivial with XORs, since it can be described perfectly by one XOR, just as the case with one quadrant of 1s is trivial with conjunctions. Let's look at the one quadrant with XOR:

If we use one XORed hypotheses, !((x > 0) XOR (y > 0)) and two simple ones (x > 0) and (y > 0), each of the three will have the 75% success rate. The upper-right corner will get all three votes for 1, and the rest of the quadrants will get two votes for the right value 0 and one against it, so the right vote would win in them too.

The way I described it, it looks like the XORed hypothesis in this example isn't clearly better than the simple ones. But that's not how it will work with AdaBoost. After the two simple hypotheses are picked, the relative weights of the upper-left and lower-right quadrants would grow, and the XOR clearly is better at fitting them.

Overall, it looks like XORing is better at producing the useful combined hypotheses than conjunctions: the conjunctions required 4 complex hypotheses to fit the XOR case but XOR required only 3 hypotheses to fit the conjunction case, 2 of them simple hypotheses.

BTW, I took a class on machine learning recently, and it turns out that even with the neural network this problem doesn't have a great solution. The neural networks are subject to the random initial state before the training, and sometimes they do find the good combinations and sometimes they don't. This tends to get solved more reliably by the human intervention: a data scientist stares at the data, maybe does a few training attempts to see what happens, and then manually adds a new event that represents the XOR of the original events.

So, thinking about it, I've remembered again about the concept of conjunctions that I've read in the AdaBoost book. I picked up the book and tried to find the conjunctions but couldn't: they're not anywhere in the index. Fortunately, an Internet search turned up the online copy of the book, and the conjunctions are in the chapter 9.

The idea there is that the quality of AdaBoost partial hypotheses can be improved by forming them as conjunctions of two or more raw partial hypotheses. The idea is to take the two best choices and try a conjunction on them. If it gets better, try adding the third one.

For a simple example, consider the data with two parameters, with values 1 collected in the upper-right quarter, and 0 in the remaining three quarters:

. ^y 0 | 1 | -------+------> | x 0 | 0

The two partial hypotheses (x > 0) and (y > 0) taken together but separately will identify the upper right quarter quite decently. Each of them will have a 75% success rate. But the values in the upper-left and lower-right corner will get one vote "for" and one vote "against" from these hypotheses and will be left undecided. I actually think that the classic AdaBoost would not be able to decide well for them. It would probably crowd the field with the extra hypotheses in the upper-right corner like (x > 1), (y > 1), (x > 2), (y > 2) and so on so that the upper-left and lower-right quarters would get better but the parts of the upper-right close to the boundary would get worse. If we take the classic Bayesian approach and consider that the prior probability of encountering a 0 is 75% (assuming that all the quarters are filled at the same density), the "undecided" result would leave the probability of 0 at the same 75%, and it would be a vote for 0. So this is probably a good example of how AdaBoost could benefit from using the prior probabilities.

The conjunctions are good at handling such situations: if these partial hypotheses are combined into one conjunction (x > 0) & (y > 0), we get a hypothesis with 100% success.

They can be used to handle the XOR situation as well. Let's look at the situation with 1s in upper-right and lower-left corners:

. ^y 0 | 1 | -------+------> | x 1 | 0

The two partial hypotheses formed by conjunctions, ((x > 0) & (y > 0)) and ((x < 0) & (y < 0)), would each give the 75% success rate. The quadrants with 0s would get correctly voted by both hypotheses but 1s would get one vote for and one vote against. But the similar hypotheses can be added for the other two quarter: !((x > 0) & (y < 0)), !((x < 0) & (y > 0)). These would add two correct votes for the quadrants with 1s, and two opposite voted canceling each other for the 0s, and taken together these 4 hypotheses can identify everything correctly.

An interesting thing to note here is that the raw hypotheses like (x > 0) aren't the best ones at selectivity in this example. They're actually the worst ones, having the exact 50% correctness. But then again, pretty much all the hypotheses with vertical and horizontal boundaries here will have about 50% selectivity. Hunting for slightly better than 50% by selecting boundaries that fit the random density fluctuations would actually make things worse, not hitting the right center. I actually see no way to select the right boundaries by choosing the raw hypotheses independently. Only taken together in a conjunction they become optimizable.

The other way to do it would be to use XORs instead of conjunctions. The case with two diagonal quadrants of 1s is trivial with XORs, since it can be described perfectly by one XOR, just as the case with one quadrant of 1s is trivial with conjunctions. Let's look at the one quadrant with XOR:

. ^y 0 | 1 | -------+------> | x 0 | 0

If we use one XORed hypotheses, !((x > 0) XOR (y > 0)) and two simple ones (x > 0) and (y > 0), each of the three will have the 75% success rate. The upper-right corner will get all three votes for 1, and the rest of the quadrants will get two votes for the right value 0 and one against it, so the right vote would win in them too.

The way I described it, it looks like the XORed hypothesis in this example isn't clearly better than the simple ones. But that's not how it will work with AdaBoost. After the two simple hypotheses are picked, the relative weights of the upper-left and lower-right quadrants would grow, and the XOR clearly is better at fitting them.

Overall, it looks like XORing is better at producing the useful combined hypotheses than conjunctions: the conjunctions required 4 complex hypotheses to fit the XOR case but XOR required only 3 hypotheses to fit the conjunction case, 2 of them simple hypotheses.

## Monday, September 25, 2017

### Boosting book reference

I've accidentally found the Freund and Schapire's book on AdaBoost online:

https://mitpress.mit.edu/sites/default/files/titles/content/boosting_foundations_algorithms/toc.html

Great for searching.

https://mitpress.mit.edu/sites/default/files/titles/content/boosting_foundations_algorithms/toc.html

Great for searching.

Subscribe to:
Posts (Atom)