I/O Reader http://ioreader.com/feed Peter Goodman's blog about computer programming. Sat, 17 Jul 2010 14:46:19 GMT en <![CDATA[PEGs and Left Recursion]]> http://ioreader.com/2010/07/17/pegs-and-left-recursion http://ioreader.com/2010/07/17/pegs-and-left-recursion#comments Sat, 17 Jul 2010 14:46:19 GMT Peter Goodman 2k Parsing Expression Grammars (PEGs) have been an interest of mine for some time now. Last summer (2009) I made a parser generator using the C programming language that parsed using a grammar language with the same matching power as PEGs but with some more expressive (but basic) tree-building constructs. Unfortunately, parsers generated with my parser generator sometimes ran into trouble when grammars used left recursion; specifically: indirect left recursion.

Recently a post came up on the PEG mailing list that made me remember the work I had done last summer and it re-ignited my interest in using PEGs with left-recursive grammars. While reading that post, a fairly simple formulation of how to handle both direct and indirect left recursion occurred to me. The formulation itself is similar to that of Warth et al. in that it detects left recursion and re-executes any rule that is the root of left recursion. The difference is that the cache / memo table does not need to be treated differently for left recursive invocations of the grammar's productions.

Definitions

For the sake of simplicity, I will deal with a subset of PEG specification that is not as powerful as PEGs as it lacks the followed-by and not-followed-by operators (& and !, respectively). With this in mind, define a grammar for a language as follows:

  • A set of variables. A variable is a name given to a set of structures within a language. For example: imagine that we want to describe the language M of mathematical expressions over the integers and that we want to assign the variable Summation to all expressions formed by either adding or subtracting two expressions in M. If A and B be two valid expressions in M then an expression that is structured as A + B or A - B is a identified by the Summation variable.
  • A set of terminals. Words in a language are sequences of terminals. If our language is the set of English words then the terminals of the word hello are the letters h, e, l, l, and o. If our language is C then the terminals include return, go to label, int, etc.
  • A set of symbols. The set of symbols is the union of the set of variables and terminals. We will assume that the set of terminals and the set of variables are disjoint.
  • A set of productions. Each production is a sequence of one or more symbols. A production represents a pattern or structure to match. The variables within the production represent sub-structures to match. For convenience, we define a function symbols[G] mapping a production Pi to a sequence of zero or more symbols.
  • A function productions[G] (for some grammar G) mapping variables to a sequence of zero or more productions. A variable that is mapped to an empty sequence of productions will always match. We will use |productions[G][V]| to represent the number of productions for some grammar G and some variable V.
  • A parser is a procedure that determines whether or not a string of terminals can be generated / accepted by a grammar. Parsers usually do more than this: they often build parse trees or interpret the terminals directly. For the purpose of this article, the function of a parser will be to return the number of terminals of a string can be generated by a grammar.

Define a string as a sequence of zero or more terminals.

The First Parser

We can begin by defining a simple parser for a language described by an (ordered) grammar by the following relatively straightforward pseudocode procedure:

parse(grammar: G, variable: var, string: str, int: offset=0):
int: start_offset = offset bool: last_production_failed = false for production: P in productions[G][var]: last_production_failed = false for symbol: S in symbols[G][P]: if S is a terminal: if terminal(S) is terminal(str[offset]): offset = offset + 1 else: go to label try_next_production else: variable: prod_var = variable(S) try: offset = parse(G, prod_var, str, offset) catch Error: go to label try_next_production go to label done label try_next_production: last_production_failed = true offset = start_offset if last_production_failed: raise Error label done: return offset

The behaviour of this parser is fairly straightforward. Given a grammar, a starting variable, and a string, it will work its way through the string and either return the some integer length representing how much of the string was parsed or raise an Error. If the sequence of symbols for a given production is empty then the inner loop will not execute and the procedure will return the offset passed into it, effectively matching nothing. If the sequence is non-empty then matching terminal symbols will cause the parser to move forward and matching non-terminal symbols will result in a recursive call to the parse procedure that updates the string offset to the terminal immediately following the last terminal matched by the call to parse. If matching a non-/terminal fails then the offset is reset to be the offset that was passed into the current invocation of parse and stored in start_offset.

Unfortunately, this parser still suffers one obvious flaw: it risks repeating a lot of work in the event that the execution of some productions starting at any particular point in the string share common prefixes. Consider the following BNF-like grammar, but with ordered choice:

START0A x
START1A b

A0       → a A a
A1
When this grammar is executed using the above parse procedure on the string aaaab, START0 successfully matches A at offset 0 but then fails to match x. START1 is then tried, matches A (again) at offset 0, and then matches START0.

The Second Parser

We can solve the problem of repeating work by remembering work that we have already done. This is how PEGs parse strings in time that is linear to the number of terminals in the string and the number of variables in the grammar. We can use a two-dimensional table to represent our memory. The first dimension will be accessed by a variable name. The second dimension will be accessed by the offset of a terminal in a string. This table will be referenced by the parameter cache in the following code:

(variable × int) → (int + Error): cache

parse(grammar: G, variable: var, string: str, int: offset=0):
int: start_offset = offset bool: last_production_failed = false for production: P in productions[G][var]: last_production_failed = False for symbol: S in symbols[G][P]: if S is a terminal: if terminal(S) is terminal(str[offset]): offset = offset + 1 else: go to label try_next_production else: variable: prod_var = variable(S) if (prod_var, offset) in cache: if cache[prod_var][offset] is Error: go to label try_next_production else: offset = cache[prod_var][offset] else: try: offset = parse(G, prod_var, str, offset) catch Error: go to label try_next_production go to label done label try_next_production: last_production_failed = True offset = start_offset if last_production_failed: cache[var][start_offset] = Error raise Error label done: cache[var][start_offset] = offset return offset

The modifications to the original parsing procedure are few but powerful. Before executing a recursive call to the parse procedure, a check is done to see if the parse procedure has already been called at the desired offset for the specific variable. If the result of the previous call is an Error then the next production is tried. If the result is an integer then we assign that result to offset and move on.

Still, there remains a problem: the parse procedure will not terminate if the grammar contains left recursion. For example, the following grammar should be able to parse the string aaa; however, because of the property of ordered choice, the parser never stops calling parse with START0 at offset zero.

START0START a
START1

The Final Parser

How can we prevent this behaviour when the grammar contains left recursion? One surprisingly simple way to handle left recursion is to fail! Suppose that we could detect a (direct or indirect) left recursive production application on some variable V. If the previous application of V was the production Vi then for the current application of V, we can pretend to fail Vi and apply Vi+1 if it exists. If Vi+1 does not exist then we simply fail to apply V and record the failure in cache. The intuition is that we expect that a left recursive application should eventually match something, i.e. it should have at least one base case. If a base case matches then we can imagine collapsing the parser stack (easier with an explicit stack) until we hit the root of the current left recursive invocation, and then we can substitute the base case in as the result of applying the variable left-recursively, and continue on past it. The trick is to continue growing this left recursive root by substituting in previous invocations in until no forward progress is made. et al. describe this process in their paper as growing the left recursive seed.

How can we detect left recursion? It is surprisingly simple, in fact. If the last application of a particular variable is at the same terminal offset in the string as the current application of the variable then we are applying the variable left recursively. If we maintain a stack of production applications and their terminal offsets for each variable then we figure out if we are applying a production using left recursive by peeking at the top of the stack.

What is more important, however, is that the stack allows us to identify the root of a left recursive invocation. Suppose that for each variable we maintain a stack of type production × bool × bool × int representing the production being applied, whether or not the production is left recursive, whether or not the production is the root of left recursion, and the terminal offset to which the production was applied, respectively. We can detect left recursion and the left recursive root as follows:

is_left_recursive(Stack[production × bool × bool × int]: stack, int: offset):
    match top(stack):
        case _, True, _, offset: 
            return True
        case _, False, _, offset: 
            top[stack].is_left_recursive = True
            top[stack].is_left_recursive_root = True
            return True
    return False

The above procedure can be used to determine whether or not a production that is about to be applied is left recursive given the stack of previous applications of this production's variable and the offset at which the production is being applied. The procedure might also update the element on the top of the stack to be left-recursive and be the root of left recursion.

Making a parse procedure work correctly according to the idea of recognizing and growing left recursion can be subtle, and I think is easiest to do by managing the parsing stack explicitly. Instead of including pseudo code, I defer to my toy C++ implementation of the above ideas.

I think that using explicit stacks of production application information for each variable in the grammar is the easiest way to recognize left recursive invocations and the roots of those invocations, as evidenced by is_left_recursive.

]]>
<![CDATA[Summer Update]]> http://ioreader.com/2010/05/10/summer-update http://ioreader.com/2010/05/10/summer-update#comments Mon, 10 May 2010 13:39:43 GMT Peter Goodman 2j It is summer time and that means time to push out a few blog articles. This summer I will again be working with Prof. Sheng Yu but this time on the Grail+ project. In my spare time, I will be working on a simple compiler written in C++.

This past academic year went exceptionally well for me and I hope to get around to reporting on some of the more interesting projects that I worked on. I think the most interesting one was an AI to play the game Gomoku.

That is it for now; I hope to get around to writing some posts soon.

]]>
<![CDATA[The Small Core of Parsing Expression Grammars (PEGs)]]> http://ioreader.com/2009/09/03/the-small-core-of-parsing-expression-gra http://ioreader.com/2009/09/03/the-small-core-of-parsing-expression-gra#comments Thu, 03 Sep 2009 16:40:53 GMT Peter Goodman 2i Informal Definition of PEGs

Per the Wikipedia definition, a PEG is similar to a Context-free Grammar (CFG) insofar as both describe the structure of a language recursively by means of rewrite rules. PEGs, however, are also quite different from CFGs.

PEGs introduce the idea of ordered choice when rewriting a variable that appears in a rule. CFGs on the other hand have no such concept of ordered choice when rewriting a variable and so all rules must be evaluated at once, or at least tested. For this reason, non-deterministic pushdown automata (PDA) are ideal machines for recognizing the languages accepted by a given CFG. More formally, let GP be some PEG and GC be some CFG and have both share the same alphabet, Σ, and the same set of variables, V.

Let RP be the relation of variables to rewrite rules for GP and RC be the relation of variables to rewrite rules for GC. Let P be the set of all possible rewrite rules, (V ∪ Σ)*. Then RC: V → ℘(P) and RP: V → P*. Clearly, RC relates each variable to a set of sequences (unordered), whereas RP relates each variable to a sequence of sequences (ordered).

Ordered choice also hints at one of the key differences between the interpretation of PEGs and CFGs: PEGs describe how to parse a string and thus check if said string belongs to the language accepted by the PEG, whereas CFGs describe how to generate every string in the language accepted by the PDA equivalent to the CFG.

Finally, PEGs are very expressive and have two particularly interesting unary predicates that CFGs do not: followed-by (&) and not-followed-by (!). You can find a description of all of the operators defined for PEGs in the above linked Wikipedia article.

Top-Down Parsing Language (TDPL)

The TDPL is a strict subset of a PEG and was invented before PEGs. Everything that can be expressed by a TDPL grammar can thus be expressed by a PEG; however, the converse is not true: PEGs feature the aforementioned predicate operators and those cannot be expressed in the TDPL.

Prolog's Cut Operator

Suppose we introduce Prolog's cut operator into the TDPL. We can formally describe the Prolog's cut as a parsing operator as follows: Let ν ∈ V be some variable of P and RP(v) → (S1, S2, …, Sj, …, Sn) be the rewrite rules for ν. Suppose we are currently parsing according to the rule Sj. The continuation of the parser for ν should Sj fail at the current position in the string is (Sj+1, …, Sn). If we reach a cut operator in Sj then we replace the current parser continuation from Sj with the empty sequence, (), and move to the next symbol in Sj. Another interpretation of the cut operator is that it disables the parsers ability to backtrack and retry parsing with the rule Sj+1 if Sj fails at any point after a cut has been reached.

We will extend the TDPL with the cut operator and call this language !TDPL (for the sake of distinguishing it from the TDPL in this article). We want to prove the equivalence in matching power of the !TDPL and PEGs. Let T be some !TDPL grammar, P be some PEG, and s be some string. We will prove by construction the equivalence in matching power by defining each of the cut operator and the (not-)followed-by operators in terms of the other. The intention is to show that the !TDPL is the "small core" of a PEG.

If: s ∈ L(T) ⇒ s ∈ L(P)

For this half of the proof, we need only define the followed-by and not-followed-by predicates in terms of the cut operator. Supposed the following rules exist in P:

  • B → !b

  • A → &a

We can define equivalent rules in T as follows. Notice that the followed-by predicate is equivalent to the negation of the followed-by, or rather: not-followed-by ◦ not-followed-by. Thus, it is necessary only to prove that the cut is just as expressive as the not-followed-by operator in order to prove !TDPL and PEG equivalence. Interestingly, this implies that the TDPL with the addition of a not-followed-by predicate is as small of a core as the !TDPL.

  • B → b ! ƒ / ε

  • A → C ! ƒ / ε
  • C → a ! ƒ / ε

Only-If: s ∈ L(T) ⇐ s ∈ L(P)

Suppose the following rule exists in T:

  • A → B ! C / D

We can define an equivalent rule in P as follows:

  • A → !B D / B C

Generalized TDPL (GTDPL)

Interestingly, all of this has been shown before but not with the cut operator. PEGs have an alternate simple core as proven by Brian Ford. It is the GTDPL (information can be found on the GTDPL in the TDPL Wikipedia entry), another simple extension of the TDPL.

The GTDPL defines an if-then-else mechanism using the following syntax:

  • A → B[C, D]

Which is equivalent to the !TDPL rule in the previous section.

Conclusion

All of this has, in essence, been an exercise in tedium. It was not necessary to re-express the GTDPL in another way; however, it was fun. While making my PEG parser generator earlier this summer (it was in interest project, the program is not meant for serious use) I ended up "discovering" that I could have all of the expressive power of PEGs using the simple-to-implement cut operator as opposed to implementing all of the PEG operators. This, coupled with some tree-building operators allowed me to generate the same parse trees that the equivalent PEG would. Had I done more research at the time, I would have realized that I had not really discovered a more succinct core to the PEG language but instead re-expressed a well-established one using a construct from a language that was familiar to me: Prolog. Nevertheless, the process was fun and rewarding!

]]>
<![CDATA[StateJava Pre-Processor]]> http://ioreader.com/2009/08/24/statejava-preprocessor http://ioreader.com/2009/08/24/statejava-preprocessor#comments Mon, 24 Aug 2009 09:49:41 GMT Peter Goodman 2h I have been doing some work every now and then over the course of the summer for one of my professors, Prof. Sheng Yu, at The University of Western Ontario. The work is based off of one of his papers, Adding States into Object Types, where he develops the τ-calculus as an extension to the ς-calculus of Abadi and Cardelli.

My work was to develop a pre-processor for the JavaTM programming language that turns classes into state machines. A description of how the pre-processor works can be found here: http://ioreader.com/code/python/state-java-pre-processor.pdf.

package example;

public class Auto {
    states { :Forward, :Neutral, :Reverse, :Off }
    
    public Auto() :Off { 
        System.out.println("AUTO: Off (constructor)"); 
    }
    
    /* the turnOff method shows a motivation for having the 
       ability to remember a previous state. */
      
    public void turnOn() :Off -> :Neutral { 
        System.out.println("AUTO: On (Neutral)."); 
    }
    
    /* the following methods ass a motivation for having 
       the ability to perform set operations, such as 
       difference, on states. */
    
    public void turnOff() :Forward, :Neutral, :Reverse -> :Off { 
        System.out.println("AUTO: Off."); 
    }
    
    /* this will be specialized by Car::reverse() for the 
       transition :Forward -> :Reverse */
    public void reverse() :Forward, :Neutral -> :Reverse { 
        System.out.println("AUTO: Reverse."); 
    }
    
    public void neutral() :Forward, :Reverse -> :Neutral { 
        System.out.println("AUTO: Neutral."); 
    }
    
    /* the following two methods will be merged into one in the
       compiled version. */
    
    public void forward() :Neutral -> :Forward { 
        System.out.println("AUTO: Forward (from Neutral)."); 
    }
    
    public void forward() :Reverse -> :Forward { 
        System.out.println("AUTO: Forward (from Reverse)."); 
    }
}
package example;
import example.Auto;

public class Car extends Auto {
    
    public Car() :Off { 
        System.out.println("CAR: Off (constructor)"); 
    }
      
    /* example of state specialization when overwriting
       parent class methods. */
    
    public void reverse() :Forward -> :Reverse { 
        System.out.println("CAR: Reverse (from Forward)."); 
    }
}
package example;
import example.Car;

public class Test {
    static public void main(String[] args) {
        Car car = new Car();
        car.turnOn();
        car.forward();
        car.reverse(); /* Car::reverse() specialization */
        car.neutral();
        car.forward();
        car.reverse(); /* Auto::reverse() */
        car.turnOff();
    }
}

When the above code is run through the pre-processor actual JavaTM code is produced. The processed code can then be compiled and run as usual. The following is the output from running the Test class above:

AUTO: Off (constructor)
CAR: Off (constructor)
AUTO: On (Neutral).
AUTO: Forward (from Neutral).
CAR: Reverse (from Forward).
AUTO: Neutral.
AUTO: Forward (from Neutral).
CAR: Reverse (from Forward).
AUTO: Off.

The code for the pre-processor can be found here: http://ioreader.com/code/python/state-java/

]]>
<![CDATA[Self-Descriptive Grammars]]> http://ioreader.com/2009/06/18/self-descriptive-grammars http://ioreader.com/2009/06/18/self-descriptive-grammars#comments Thu, 18 Jun 2009 19:57:54 GMT Peter Goodman 2g
  • Languages have structure.
  • We can make a new language, X, to describe the structure of languages.
  • Language X itself has a structure and as a result the structure of language X can be described.
  • We made language X to describe the structure of languages.
  • Therefore, language X can describe its own structure.

]]>