Sergey Babkin on CEP and stuff: how to reinvent the bicycle 2

I've been thinking more about the problem of teaching the engineering
approach to problem solving, and I've decided to try using another
interview problem as an example. It's an older problem, and I really like
it, but I don't use it any more because the other people liked and used it
too, and the solutions are already available on the Internet.

I'm not just going to give the solution, blindly remembering that is of no
use to anyone. I want to show how the better solutions can be born out of
the bad solutions. I'm going to start with the worst solution I can think
of, and then gradually show the better solutions. The puzzle for you, the
reader, is to use the issues in these solutions as hints towards
the better solutions, that would take you as far ahead as possible.

I wrote those solutions as I would do at an interview, without actually
compiling and running the code on a computer. So they might contain bugs,
from trivial bugs to the major ones. Whenever I've noticed a major bug
after writing a solution and following through its execution manually, I've
left the buggy version in, and added the fixed version. Sometimes a solution
was so overcomplicated that I didn't notice a bug in it until I moved on to
another, simpler solution. There still might be bugs, but hopefully not too
bad ones. It just shows that the overcomplicated approaches should be
avoided.

The problem is to write a matcher for the very simple regular expressions,
that include only the operators "." (any character) and "*" (any number of
repeats of the previous character). The "*" is greedy, consuming as many
matching characters as possible. There is no escape character like
backslash. The string must match completely, as if the regexp implicitly
had "^" at the font and "$" at the end.

For the Windows people who don't know regular expressions, the regexps are
a way of pattern matching, kind of like the file patterns (which are known
in the Unix world as globs) but different. Most of the characters in the
regexps have the literal meaning, just as in the file patterns. The
character "." is like "?" in the file patterns, matching exactly one any single
character. The character "*" is different than in the file patterns. It
means zero or more repetitions of the previous element, so the file pattern
"*" is analogous to the regexp ".*".

The function declaration will be:

bool match(const char *pattern, const char *text);

Let's start with the analysis.

The first thing to notice about this problem is that some regular expressions
in it are impossible to match. The "a*a" will never match anything because
the greedy "a*" will consume all the "a"s, and the second "a" will never encounter
a match. The same goes for "*." followed by anything, because ".*" will
consume everything to the end of the string.

The other thing to consider is, could the strings be in UTF-8? UTF-8 can
be handled transparently in many ways but this is not one of them. If
treated simple-mindedly, the "*" after a multi-byte Unicode character would
expect the repeat of its last byte. And a "." would match a single byte of
a multibyte character. Specifically for UTF-8, this problem
can be handled by keeping track of the continuation. Every byte of a
multi-byte character in UTF has the high bit 0x80 set, and the first one of
them has also the bit 0x40 set, so you can parse where the character
starts. I've looked up the exact bitmasks on the Internet, it's not
something that I'd expect anyone to know off-hand at the interview, nor the
details of the exact UTF-8 encoding, but the general principle that UTF-8
is easy to parse is a reasonable knowledge. However this kind of parsing is a
bad idea anyway, because it would mess up the handling of the strings in
the single-byte 8-bit encodings. A much better approach is to use the
library functions to either convert the whole strings to wchar_t (requires
the allocation of extra memory) or to parse the strings
character-by-character as they get read (better). This is important to
mention to show that you understand the multi-byte encoding issues, and can
bring you bonus points at an interview. But the problem itself is not about
that, the multi-byte support can be trivially added after the general
solution is found, so let's assume that we don't need to care about that
for now, and the strings are in ASCII.

The first solution goes in the most complicated way I can think of. You
might have attended a college course on parsing that talked about the
finite machine matcher for the regular expressions. The most unsophisticated
approach is to push this way blindly.

Each node of the finite machine graph could be a plain array of the
possible exits from that node, but to make things easier to write, let's
use a C++ map:

using namespace std;

struct Node {
    map<char, Node*> exits_;
};

The parsing of the pattern will start like this:

bool match(const char *pattern, const char *text) {
    vector<shared_ptr<Node> > holder;
    Node *parsed = nullptr;

    for (const char *p = pattern; *p != 0; p++) {
        ...
    }

}

Since we're dynamically allocating the Nodes, we need to take
care of freeing them too. A simple way to do it is to use the reference
counting in the form of shared_ptr or unique_ptr. But we can't store them
right in the Node, because the Nodes may have circular references. The
unique_ptr just will refuse to do so (if you insist with std::move(),
you'll get NULL pointers in the unexpected places, and will probably
crash), and shared_ptr would create a link loop and leak the memory. It's
possible to arrange the entries in a plain list, and then free them at the
end manually. But the easier way to do it is to keep the reference counts
in a separate vector, and let the nodes connect in whatever ways with the
pointers. And these should be the real pointers, not the weak references.
The weak references are needed when you might need them to become strong at
some point. Here we have no such need, so save the overhead.

We can also use the holder to find the first and last allocated node, so we
don't need the separate parsed pointer.

What should the loop do? Since the Nodes contain the map of exit links and
not entry links, on each literal character in the pattern we'll need to
create a new node, and then update the _previous_ node with the exit link
to the new one. So we need to start by allocating one initial node before
the loop, to always have a previous node.

When we see a '.', we populate the map for every character code except '\0' to
point to the next node. If we had to deal with the wide characters, this
wouldn't work, because the number of characters would be too great. Then
we'd have to add a special "default" link. But for ASCII it's good enough.

When we see '*', we don't allocate a new node but make the previous node a
copy of its previous node, pointing to itself (so we should also keep a
pointer to the node before last).

And when we see a '\0', we will add to the last node an exit that maps the
'\0' to a NULL pointer, this being a signal to stop. We could also add a
separate boolean field final_ for this purpose, but a null pointer is just
as good, and safer in the way that we wouldn't need a separate code path
for checking the finality, and will always check all the pointers for NULL.

Here is the loop fleshed out:

bool match(const char *pattern, const char *text) {
    vector<shared_ptr<Node> > holder;
    holder.push_back(make_shared<Node>());
    Node *previous = nullptr;

    for (const char *p = pattern; *p != 0; p++) {
        if (*p == '*' && previous != nullptr) {
            holder.back()->exits_ = previous->exits_;
        } else {
            previous = holder.back().get();
            holder.push_back(make_shared<Node>());

            if (*p == '.')  {
                for (int c = 1; c <= 255; c++) {
                    previous->exits_[(char)c] = holder.back().get();
                }
            } else {
                previous->exits_[*p] = holder.back().get();
            }
        }
    }
    holder.back()->exits_['\0'] = nullptr;

}

This code has to deal with some situations that "shouldn't happen", the
invalid patterns that either start with a "*" or have multiple "*" in a
row. This is something that needs to be called out explicitly. Two ways to
deal with it is to either return some kind of an error indication (such as
throw an exception) or just process them in some not officially defined
way. This code takes the approach of the silent handling, simply because
it's easier to do in a small code snippet. From the caller's standpoint it
might be either good or bad: the good is that the caller won't have to
handle the errors, the bad is that the author of the pattern might be
surprised by its effect. Whatever choice you make, you need to be aware
of this trade-off. The silent handling here is that the "*" at the start is
treated as a literal, and multiple "*" are treated as one.

This code is also not correct in the handling of "*", in two ways! How do I
know this? I tried to run through it manually on the examples of what I see
like corner cases: "", "a", ".", "a*", "a*b", "a*a", ".*", ".*a" with the
matching and mismatching strings, and I actually tried the more complex
cases first.

The first defect is that it doesn't handle the greediness right. A pattern
like ".*a" will not be unmatchable, instead it would mean what in the full
regexps can be expressed as "[^a]*a". Which might be not bad by itself but
isn't what the problem statement says. Worse yet, due to the same mistake
the pattern "a*a" will be equivalent to a single "a", which is outright
wrong. The fix for this problem is to check whether the map value at this
key has been already set, and if so then don't overwrite it:

struct Node {
    map<char, Node*> exits_;

    void set_exit(char c, Node *to) {
        if (exits_.find(c) == exits_.end()) {
            exits_[c] = to;
        }
    }
};

bool match(const char *pattern, const char *text) {
    vector<shared_ptr<Node> > holder;
    holder.push_back(make_shared<Node>());
    Node *previous = nullptr;

    for (const char *p = pattern; *p != 0; p++) {
        if (*p == '*' && previous != nullptr) {
            holder.back()->exits_ = previous->exits_;
        } else {
            previous = holder.back().get();
            holder.push_back(make_shared<Node>());

            if (*p == '.')  {
                for (int c = 1; c <= 255; c++) {
                    previous->set_exit((char)c, holder.back().get());
                }
            } else {
                previous->set_exit(*p, holder.back().get());
            }
        }
    }
    holder.back()->set_exit('\0', nullptr);

}

The second problem is even worse, "*" ends up meaning "1 or more
repetitions" instead of "0 or more repetitions", like the operator "+" in
the full regexps. To fix that, on the next character after a "*" we need to
add the new link to both the previous and pre-previous node.

The state machine graph for the expression "a*b" used to look like this:

 
      a
     +-+
   a \ /  b    \0
O-----O-----O-----X

Now it will look like this:

 
      a
     +-+
   a \ /  b    \0
O-----O-----O-----X
 \         /
  +-------+
     b

The mistake I've made here is that I haven't started writing out the
explicit graph examples until after finding the bug. It could have been
prevented by thinking through the graphs from the start.

The code change is:

bool match(const char *pattern, const char *text) {
    vector<shared_ptr<Node> > holder;
    holder.push_back(make_shared<Node>());
    Node *previous = nullptr;
    Node *before_star = nullptr;

    for (const char *p = pattern; *p != 0; p++) {
        if (*p == '*' && previous != nullptr) {
            if (before_star == nullptr) {
                holder.back()->exits_ = previous->exits_;
                before_star = previous;
            }
        } else {
            previous = holder.back().get();
            holder.push_back(make_shared<Node>());

            if (*p == '.')  {
                for (int c = 1; c <= 255; c++) {
                    previous->set_exit((char)c, holder.back().get());
                    if (before_star) {
                        before_star->set_exit((char)c, holder.back().get());
                    }
                }
            } else {
                previous->set_exit(*p, holder.back().get());
                if (before_star) {
                    before_star->set_exit(*p, holder.back().get());
                }
            }
            before_star = nullptr;
        }
    }
    holder.back()->set_exit('\0', nullptr);
    if (before_star) {
        before_star->set_exit('\0', nullptr);
    }

}

This finally took care of the pattern parsing. Mostly. It still has a bug
with handling the sequences of starred characters, such as matching the
pattern "a*b*c" to the string "c". It will not accept "c" because at each
point we update only one previously starred node, not the whole sequence of
them. This is a pretty tricky code, and I didn't even notice this error
until I started doing this problem the next, easier, way, and noticed it
there, and then realized that the first version had this issue too.

As it has turned out after I drew the graph and traced the code, it's even
worse than that. In the "a*b*c", it would accept any mix of "a"s and "b"s,
rather than strictly "a"s followed by "b"s, caused by the copying of all the
links from the node before star.

Well, let's fix it. The graph for "a*b*c" used to look like this:

 
      a      a,b 
     +-+     +-+ 
   a \ /  b  \ /  c       \0
O-----O-------O-------O-------X
 \     \     /       /
  +-----\---+       /
   b     +---------+
              c

It should look like this:

 
      a       b 
     +-+     +-+ 
   a \ /  b  \ /  c       \0
O-----O-------O-------O-------X
|\     \     /       /|
| +-----\---+       / |
|  b     +---------+  |
|             c       |
 \                   /
  +-----------------+
          c

Whenever there is a sequence of starred characters, there has got to be a
link from before each starred node to each following starred node, and to
the node past the stars. The only way I could figure this out was with the
graph diagram.

This is very difficult to do by just following the graph, because there is
no indication in the graph, which link is going to the following node and
which go to the other nodes including itself.

The better way to resolve it is to notice that we're appending the nodes to
the vector sequentially anyway. So we can use the vector order to go
through them in sequence. Thus the variable before_star turns from a
pointer into a vector index:

struct Node {
    map<char, Node*> exits_;

    void set_exit(char c, Node *to) {
        if (c == '.') {
            for (int i = 1; i <= 255; i++) {
                if (exits_.find((char)i) == exits_.end()) {
                    exits_[i] = to;
                }
            }
        } else {
            if (exits_.find(c) == exits_.end()) {
                exits_[c] = to;
            }
        }
    }
};

bool match(const char *pattern, const char *text) {
    vector<shared_ptr<Node> > holder;
    holder.push_back(make_shared<Node>());
    int first_link = 0;
    char last_char;

    enum State {
        AFTER_STAR,
        AFTER_LITERAL,
    };
    State state = AFTER_LITERAL;

    for (const char *p = pattern; *p != 0; p++) {
        if (*p == '*' && holder.size() > 0) {
            holder.back()->set_exit(last_char, holder.back().get());
            state = AFTER_STAR;
        } else {
            last_char = *p;
            if (state == AFTER_LITERAL) {
                first_link = holder.size() - 1;
            }

            holder.push_back(make_shared<Node>());
            for (int i = first_link; i < holder_size() - 1; i++)
                holder[i]->set_exit(*p, holder.back().get());

            state = AFTER_LITERAL;
        }
    }
    for (int i = first_link; i < holder_size() - 1; i++)
        holder[i]->set_exit('\0', nullptr);

}

The setting of all exits on a "." became used in multiple places, so it
moved into set_exit(). The parsing of the star syntax became more
complicated, requiring to remember the last character, and to keep the
state for detection of where the sequence of starred characters ends.
The "classic" way of state handling is to put the state processing as a
switch statement in the loop but that would have made the code way too
painful and repetitions, so I've decided to skip that example of doing
things per an unsuitable pattern and move on right to the more
straightforward version above.

Now let's do the pattern matching:

bool match(const char *pattern, const char *text) {
    // ... skipped the pattern parsing ...

    Node *cur = holder.front().get();
    for (const char *t = text; cur != nullptr; t++) {
        auto it = cur->exits_.find(*t);
        if (it == cur->exits_.end())
            return false;
        cur = it->second;
    }
    return true;
}

It's done but is hadn't been a small feat, with all the memory management
thrown in, and all the opportunities to make the mistakes in the pattern
building. It can be powered though as a show of brute force, but I don't
think that I've done it all in under an hour, which it the usual interview
time slot.

I hope that at the very least by the time I've started talking about
building the holder vector, you've got a better idea: this vector goes in
parallel with the original pattern string, so why not just work directly
with the pattern string? Indeed, this is a much easier way. But it also has
the harder and easier version.

Again, let's start with the harder version. The loop will be working in
very much the same way as the matching loop in the parsed-pattern version
(as some textbooks would teach), but will read the pattern directly from
the string.

bool match(const char *pattern, const char *text) {
    char last_pattern = '\0';

    const char *p = pattern;
    for (const char *t = text; ; t++) {
        while (true) { // for starred sequences in pattern
            // at this point p always points after the last literal
            // character but possibly before the following star

            if (*p == '*') {
                if (last_pattern == '.' || *t == last_pattern) {
                    if (*t == 0)
                        return true;
                    break;
                }
                ++p;
            }

            last_pattern = *p;
            if (*p == '.' || *t == *p) {
                if (*t == 0)
                    return true;
                ++p;
                break;
            }

            if (*p == 0)
                return false;
            ++p;
            if (*p != '*')
                return false;
        }
    }

    return false; // never reached
}

The "for" loop here is interesting, without an exit condition. This is
because the matching of '\0' follows the common pattern, and is done in the
same loop code, with the only addition being the check for the end of lines
when we detect a match of the literals. Otherwise, if we'd checked the text
and pattern for EOL in the "for" loop condition, we'd have to repeat the
inner "while" loop once more after the "for" loop, to account for the case
when the pattern ends with a sequence of starred characters.

The inner loop is necessary to handle the sequences of multiple starred
characters, such as "a*b*c" matching the "c". If we don't do the loop, "c"
would get compared to "b" and the match will be considered failed. Since
this code is much smaller and simpler than the parsed-pattern version, this
becomes much easier to notice and implement - just add a loop!

Speaking of the stars, this version takes a different approach to the
improperly placed stars in the pattern, just because it was easier to do
this way. The leading star gets ignored, and the second star in a row gets
treated as a literal. Which is OK, since this is the behavior for the
undefined patterns (unless you're writing code that must be compatible with
some existing library and need to repeat its quirks).

This code also contains a bug. It happily matches the '.' followed by
anything in the pattern to an end of line in the text. To fix that bug, the
conditions of the form

if (*p == '.' || *t == *p)

need to be replaced with

if (*t == *p || *t != 0 && *p == '.')

There is another way to take the same general approach but keep the main
loop more classic, by using a function for the internal while loop, and
then this function can be easily called twice. It isn't any better or
worse, just different (and I've included the bug fix into it):

bool match_one(char &last_pattern, const char *&p, char c) {
    while (true) { // for starred sequences in pattern
        // at this point p always points after the last literal
        // character but possibly before the following star

        if (*p == '*') {
            if (c == last_pattern
            || c != 0 && last_pattern == '.') {
                break;
            }
            ++p;
        }

        last_pattern = *p;
        if (c == *p || c != 0 && *p == '.')
            ++p;
            break;
        }

        if (*p == 0)
            return false;
        ++p;
        if (*p != '*')
            return false;
    }

    return true;
}

bool match(const char *pattern, const char *text) {
    char last_pattern = '\0';

    const char *p = pattern;
    for (const char *t = text; *t != 0; t++) {
        if (!match_one(last_pattern, p, *t))
            return false;
    }
    if (!match_one(last_pattern, p, '\0'))
        return false;

    return true;
}

And as yet another variation of the same, you could also look ahead for the
stars after the pattern and save this state in a separate variable:

bool match(const char *pattern, const char *text) {
    char last_pattern = '\0';
    bool last_starred = false;

    const char *p = pattern;
    for (const char *t = text; ; t++) {
        while (true) { // for starred sequences in pattern
            // at this point p always points before the next literal
            // character

            if (last_starred) {
                if (last_pattern == '.' || *t == last_pattern) {
                    break;
                }
            }

            last_pattern = *p++;
            last_starred = (last_pattern != 0 && *p == '*');
            if (last_starred) // skip over the star
                ++p;

            if (*t == last_pattern
            || *t != 0 && last_pattern == '.') {
                if (*t == 0)
                    return true;
                break;
            }

            if (!last_starred)
                return false;
        }
    }

    return false; // never reached
}

I think that this version is much simpler and in the real world if I
started writing the previous version, I would have never finished it, and
switched to this one.

But even this version is not great. Having to keep state feels like too much
work. It would be much easier to go the other way around, iterate through
the pattern and consume the matching characters from the text. Then the
state will become embodied in the code rather than variables, and will be
simpler:

bool match(const char *pattern, const char *text) {
    const char *t = text;
    for (const char *p = pattern; *p != 0; p++) {
        if (p[1] == '*') {
            if (*p == '.') {
                while (*t)
                    ++t;
            } else {
                while (*t == *p)
                    ++t;
            }
            ++p; // adjust to consume the star
        } else if (*p == '.') {
            if (*t++ == 0)
                return false;
        } else {
            if (*t++ != *p)
                return false;
        }
    }
    return *t == 0;
}

This version is much smaller and much easier to follow through. It does an
explicit selection by the type of each pattern element, so each one of them
has its own code fragment and not the logic that is spread through the code
and mixed with the logic of the other elements. And all this makes the
creation of bugs more difficult.

A slightly special thing about this version is that it looks ahead by two
characters, not just one, to detect that the current pattern character is
followed by a star. It's not something that's usually taught but it makes
the code a lot easier.

This version also has a theoretical foundation: it's a recursive-descent
LL(1) parser of the text. Except that the regular expressions define a
non-recursive language, so there is no recursion. It really is perfectly
per textbook, you've just got to pick the right textbook! It also parses
not a fixed grammar but one given in the regular expression pattern. So
it's an LL(2) parser of the pattern, with the nested LL(1) parsers of the
matching substrings in the text. The 2 in LL(2) means that we're looking
ahead by 2 characters. It can also be parsed by an LL(1) parser as has been
shown before but looking ahead by 2 characters makes it easier.

It is the version that kind of came to my mind almost right away when I
first thought about this problem. But I can't really say that it just pops
in my mind out of nowhere. I do size up the different approaches in my
mind, and try the ones that look simpler first. It doesn't mean that this
first estimation is always right. Sometimes I go pretty deep along one
approach before deciding to abandon it, and applying the lessons learned to
another approach. And sometimes this another approach end up even worse,
but the lessons learned there help to get through the logjam on the first
approach.

So if you start with the poor approaches, you can still arrive at the
better ones by listening to the hints that the code gives to you as you
write it. When you see an easier way to go, use it. You can also power
through the difficult approaches to the successful end but that tends to be
much more difficult than switching the approach to a simpler one.

Sergey Babkin on CEP and stuff

Wednesday, November 7, 2018

how to reinvent the bicycle 2

No comments:

Post a Comment

Links

About Me

Labels

Blog Archive