Common Lisp Recursive Descent Parser Interpreter and Generator

Recently I've been having a lot of fun learning and hacking with Common Lisp. Two of the more interesting (and not particularly elegant) pieces of code I've written include a recursive descent parser interpreter and a recursive descent parser generator. I'm happy with these two programs because both work (although the interpreter might have a few quirks) and because they were simple to make (despite the large amount of code).

Interpreter

The interpreter takes patterns (written in lisp) and at run time goes through the parts of the patterns and generates functions that can parse certain parts of text. Everything is considered a pattern function, be it a string or character or an explicit call to a pattern operator such as OR or FIND-NEXT.

The way the interpreter matches text is fairly simple: each function that is generated takes two parameters: the current list of matches, and the remaining text that hasn't been seen yet. These functions then return three parameters: the updated list of matches, the remaining text that hasn't been seen yet, and whether or not the function succeeded in matching something. The following is a simplified diagram of how the interpreter goes through the a pattern and matches text against the current part of the pattern it is looking at.

Recursive Descent Interpreter Session

As is clear from the image, there can be no back-tracking of any sorts because every matching function gets either the same or a smaller version of the text than the previous one. This fulfills the descent component of the parser: it always goes down by consuming text and cannot go back up. The interpreter is recursive in two ways: the way it interprets the patterns and creates matching functions is recursive and because this is done on-the-fly the pattern matching is inherently recursive because of this. It is also recursive because one can define multiple different pattern functions and call them from within other pattern functions.

Parser Generator

The parser generator works in a similar way but is much less elegant than the interpreter in terms of how it works. At compile time, macros do a depth-first expansion of the pattern operators into blocks of code and end up making a large chunk of code representing a pattern-matching function. Each block of code returns three values: any matches that the block found, whether or not the block matched anything, and whether or not the match consumed any text. It is up to any parent operators to deal with these values, and it's usually the case that the matches bubble up to the top-level of a pattern function and are handled there.

The three values reported by each block are important for several reasons. Pattern operators such as OR can sometimes always return a positive match but not consume any text. Consider this pattern: (maybe "hello"). This is translated into to the OR operator: (or "hello" ""). As can be seen, if the operator fails to match "hello" then it will always succeed at matching "" (an empty string). When used in the REPEAT or FIND-NEXT operators this can be dangerous, so those operators require that the patterns match as well as consume.

At first glance the matched boolean seems redundant given that each block also returns a list of matched values. The matched value is required because of the NO-CAPTURE operator, which allows the pattern to match something but not record the match. This operator is very useful because it's often required that we match some part of a pattern but where that specific part gives us no useful information. For example, when matching XML-like tags, the < and > symbols delimit tags but do not represent useful match information.

Example: Splitting Up an HTML Document

The following pattern functions properly split up some text by HTML tags. Note: the pattern functions do not validate the HTML.

When we apply the above patterns to some HTML (from the older version of this blog) using the parser generator we get the following code and parse tree:

prefix text
<div class="post">
    <a name="comment-{$comment.comment_ID}" id="comment-{$comment.comment_ID}"></a>
    {$comment.content}
    <ul class="categories">
        by 
        <cond:if var="comment.comment_author_url" neq="">

            <a href="{$comment.comment_author_url}" title="{$comment.comment_author}">
                {$comment.comment_author}
            </a>
        <cond:else />
            {$comment.comment_author}
        </cond:if>

        on {$comment.date}
    </ul>
</div>
postfix text

The Code

You can find the code for the parser interpreter and parser generators at the following locations:

One thing to note is that I am new to common lisp and I recognize that some of the code is not exceptionally attractive. If any Common Lisp hackers read this post then please feel free to comment on how I might improve the coding style! A final note is to say that the interpreter is coded in a more functional way whereas the parser generator is much more procedural.


Comments

There are no comments, but you look like you have something on your mind.


Comment