Sunday, August 29, 2021

lexical analysis with regular expressions

Today I've got this simple idea of doing the lexical analysis (i.e. tokenization) in Perl with regular expressions: write all the possible tokes as alternatives in the regular expression, and use Perl's ability to embed code into the expressions to produce the token id. For example:

$s = "++abc"; # string to tokenize
$t = 0; # token ids returned here
while ($s =~ s/^\s*([+][+](?{ $t=1; })|[+](?{ $t=2; })|[a-z]+(?{ $t=3; }))//) {
  print $t, "=", $1, "\n";
}

This looks for the tokens of "+", "++", and a simple lowercase identifier. One caveat I've found is that if one alternative is a prefix of another (such as "+" and "++"), the longer alternative must go first. But otherwise it looks pretty simple, and should be efficient. I'm probably not the first one to discover it, but I've been looking for the various lexer solutions in Perl and Python and haven't seen this one yet. Of course, it would look better in the extended syntax, and with symbolic token names.

Doing the same in Python is a bit more difficult, since it doesn't allow to execute code as part of the expression. So the parsing would be just to a string, and then matching by a string. Or then looking up the token id by a dictionary (which gets a little tricky if there could be more than one non-constant lexeme, but I guess those could be sorted in the second round if needed). Something like this:

import re
tokens = { "++": 1, "+": 2, } # regexp can be auto-generated from reverse-ordered dict keys
lex = re.compile(r'^\s*(' + r'[+][+]' + r'|[+]' + r'|[a-z]+' + r')') # the string to parse
s = "++abc"
m = lex.match(s) # 3 is the token id for the identifiers that are variable
t = tokens[m.group(1)] if m.group(1) in tokens else 3 # consume the token from the string
s = s[m.span()[1]:]

It doesn't do the loop but goes through all the important steps.

P.S. In Perl, the replacement could be avoided by using the global scanning instead, scan with option /g, and use \G as the anchor for the last end of string, new start of string:

while ($s =~ /\G\s*([+][+](?{ $t=1; })|[+](?{ $t=2; })|[a-z]+(?{ $t=3; }))/g)

No comments:

Post a Comment