A simple lexer in Python

I’m taking a course on building compilers at the Israeli Open University and just learned how to use flex. It occurred to me that building a simple lexical analyzer should be quite easy with Python’s re module. A typical lexical analyzer read a stream of text input and splits it into a list of tokens. The simplest example of such a thing is the split function which takes a sentence and returns the list of words in it.

s = "A simple lexer in Python"
s.split()
[‘A’, ’simple’, ‘lexer’, ‘in’, ‘Python’]

The problem becomes more complex when you need to separate the tokens you find into different kinds, words and numbers, for instance. We’ll use a well known lyric as our sample text:

s = """99 bottles of beer on the wall, 99 bottles of beer.
Take one down and pass it around, 98 bottles of beer on the wall."
""

The first thing we need to do is build a regular expression that recognizes words and another one that recognizes numbers. Although there are shorter ways to build those regular expressions, I like the less obscure form:

wordsRegex = "[A-Za-z]+"
numbersRegex = "[0-9]+"

We could now use findall on the string and get all the numbers and words out of it.

re.findall(wordsRegex, s)
[‘bottles’, ‘of’, ‘beer’, ‘on’, ‘the’, ‘wall’, ‘bottles’, ‘of’, ‘beer’, ‘Take’, ‘one’, ‘down’, ‘and’, ‘pass’, ‘it’, ‘around’, ‘bottles’, ‘of’, ‘beer’, ‘on’, ‘the’, ‘wall’]

re.findall(numbersRegex, s)
[‘99′, ‘99′, ‘98′]

But wait, you say, that isn’t what we wanted at all! We need to get the tokens in the order of their appearance in text and still get the type of each token. Something along the lines of

for tokenType, tokenText in lexer(s):
    print tokenType, tokenText

would be really nice.

In order to do that, we’ll need to combine both regular expressions into one and iterate on the result of findall examining each token to decide on its type.

regex = "(%s)|(%s)" % (wordsRegex, numbersRegex)
‘([A-Za-z]+)|([0-9]+)’
re.findall(regex, s)
[(, ‘99′), (‘bottles’, ), (‘of’, ), (‘beer’, ),
(‘on’, ), (‘the’, ), (‘wall’, ), (, ‘99′),
(‘bottles’, ), (‘of’, ), (‘beer’, ), (‘Take’, ),
 (‘one’, ), (‘down’, ), (‘and’, ), (‘pass’, ),
(‘it’, ), (‘around’, ), (, ‘98′), (‘bottles’, ),
(‘of’, ), (‘beer’, ), (‘on’, ), (‘the’, ), (‘wall’, )]

As you can see, the result of the call to findall is a list of tuples, each containing a single match. If you look closely at the way I’ve combined the two regular expressions, you’ll see that each part is surrounded with parenthesis and that there’s a pipe (|) between the expressions. The compound regular expression matches either a number rf a word and each tuple in the return value of findall contains the matches for each parenthesized part of the regexp. However, since we combined the parts using a pipe (|), only one of the parts matches each time.

Using that knowledge we can now construct a simple loop that shows the token type for each of the words in the lyric:

for t in re.findall(regex, s):
    if t[0]:
        print "word", t[0]
    elif t[1]:
        print "number", t[1]

We now have most of the knowledge we need to build ourselves a lexer that will take a list of regular expressions and some text and return (or even better, generate) an list of tokens and their types. We’ll need to combine the regular expressions for each token into one big regex using pipes, scan the string, and gather the tokens and their types.

Our usage code looks like this:

definitions = [
    ("word", "[A-Za-z]+"),
    ("number", "[0-9]+"),
]

lex = Lexer(definitions)
for tokenType, tokenValue in lex.parse(s):
    print tokenType, tokenValue

And here is the code for the lexer itself:

class Lexer(object):
    def __init__(self, definitions):
        self.definitions = definitions
        parts = []
        for name, part in definitions:
            parts.append("(?P<%s>%s)" % (name, part))
        self.regexpString = "|".join(parts)
        self.regexp = re.compile(self.regexpString, re.MULTILINE)

    def parse(self, text):
        # yield lexemes
        for match in self.regexp.finditer(text):
            found = False
            for name, rexp in self.definitions:
                m = match.group(name)
                if m is not None:
                    yield (name, m)
                    break

Some notes on the implementation are in order. I’ve used the little known (?P<name>…) syntax for naming the parenthesized groups of regular expressions. Using that syntax the expression (?P<word>[A-Za-z]) matches a word and that match is accessible with match.group(’word’) where match is a re.Match object.

In order to speed things up a bit, I’ve compiled the regular expression when the Lexer object is created, used the finditer function instead of findall, and made parse a generator instead of a list returning function.

Using this simple lexer implementation it was quite simple to create a Python-to-HTML converter with syntax highlighting that works well enough to highlight the code of the highlighter itself!

The code for the lexer and syntax highlighter example are available here and on my snippets page. You can also see the result of running the syntax highlighter on itself here.

Enjoy lexing and let me know if you found this useful.

14 Comments on “A simple lexer in Python”


By Andrew Dalke. October 21st, 2007 at 01:48

You might also be interested in the undocumented-and-not-official “scanner” interface in the re module. See http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/457664

The PLY parser uses this sort of approach, but it’s hard to find the details because of the extra work it does to do better error reporting. It also works around limitations in re, like not supporting over 100 tokens, and checking that each regular expression is valid before doing the “|” merge. Ie, consider ‘a(b’ and ‘c)d’ as the two input patterns. Separate they are not valid, but together they are, along with an extra ‘()’.

By gooli. October 21st, 2007 at 09:14

The scanner interface in the sre module is absolutely cool! It’s similar to my own implementation but is more efficient using the re.Match’s lastindex property.

Thanks for the info.

By James Thiele. October 21st, 2007 at 19:55

Have you looked at the tokenize module in the standard Python library?

By kib2. October 21st, 2007 at 22:22

Thanks a lot for this great article, I’ve found it very clear and usefull.

By gooli. October 22nd, 2007 at 09:19

@James
I did look at the tokenize module, but it can only tokenize Python code. That’s great it all you need to do is syntax highlight Python code. But it’s not a generic syntax highlighter. The implementation seems to be similar to what I’ve done although much more complete of course.

By Kumar McMillan. October 24th, 2007 at 23:07

unfortunately, lexical parsing doesn’t remain “simple” for very long! Last year at Pycon I saw an excellent presentation on aperiot, an easy to use lexical parsing module:
http://moncs.cs.mcgill.ca/people/eposse/projects/aperiot/atglance.dtml
I’ve never had a good reason to try it out though.

By gooli. October 24th, 2007 at 23:14

I don’t think lexing gets that much more complex, at least as long as you aren’t concerned with performance too much. There might be some issues with precedence and you could add support for different start conditions to better support things like C multiline comments. Semantical parsing on the other hand, is a much more complicated matter.

That aperiot module looks really cool. Thanks for the link.

By Norbert Klamann. December 3rd, 2007 at 14:08

Are you aware of the ’shlex’ module ?

By gooli. December 3rd, 2007 at 14:45

I am, but apparently it is only suitable for lexing languages that are similar to shell scripts.

By Fırat KÜÇÜK. January 22nd, 2008 at 23:04

good design. congratulations.

By Evan. January 5th, 2009 at 10:34

Wow, very neat. I’ll be using what you’ve shown here in future projects. Maybe a template parser.

You structured this article very well, by the way.

By William. January 28th, 2009 at 08:53

Nice idea, well presented. A good example also of using the “yield” statement.

By Marcin. January 8th, 2010 at 22:58

I like your trick of creating this one humongous ORed regexp that combines all of the other expressions. I don’t think it generalizes very well for very complicated regexps (i.e. regexps that contain ORs in them), but still it’s pretty neat.

By lastindex. March 22nd, 2010 at 14:18

[...] Ramblings Jasper Potts's Blog is proudly powered by WordPress. Entries (RSS) and Comments (RSS) …gooli.org A simple lexer in PythonI'm taking a course on building compilers at the Israeli Open University and just learned … to my [...]