I/O Reader

RSS Feed

Peter Goodman's blog about PHP, Parsing Theory, Applications, Javascript, Functional Programming,

Pattern Matching in Grail+

My undergraduate thesis work on Grail+ is almost complete. As part of this work, I developed the formal language template library (FLTL). The library contains generic representations for context-free grammars (CFGs), non-deterministic pushdown automata (PDAs), and epsilon non-deterministic finite automata (NFAs). The focus of the FLTL is to provide the basic operations needed to symbolically manipulate the aforementioned formalisms—not raw performance. This post will focus on a particularly powerful feature of the FLTL: pattern matching.

Pattern Matching in CFGs

The following are some examples of CFG production patterns that I use in the Earley parser of Grail+. They will be explained below:

// variables used in the patterns
unsigned dot;
variable_type A;
variable_type B;
terminal_type a;
production_type prod;

pattern_type scanner_known((~A) --->* cfg.__(dot) + a + cfg.__);
pattern_type scanner_unknown((~A) --->* cfg.__(dot) + ~a + cfg.__);
pattern_type predictor((~A) --->* cfg.__(dot) + ~B + cfg.__);
pattern_type completer((~A) --->* cfg.__(dot));

generator_type predictor_related(cfg.search(~prod, B --->* cfg.__));
pattern_type completer_related(cfg._ --->* cfg.__(dot) + A + cfg.__);

    

There are two key types here: pattern_type and generator_type. An individual pattern, represented by pattern_type, is able to match, destructure, and bind variables to parts of a production. A generator, represented by generator_type, is a more general abstraction (similar to iterators) that can operate using patterns, among other things.

A number of different syntactical elements are being used: cfg._, cfg.__, cfg.__(dot), --->*, a, ~A, and +.

cfg._
This represents an arbitrary symbol. A symbol is either a variable/non-terminal or a terminal of the grammar on which we are operating. This pattern component will successfully match against any symbol, but will not bind the value of the symbol to anything.
cfg.__
This represents an arbitrary symbol string of any length. This pattern component will always match, even if one is attempting to match an empty symbol string, and will never bind the matched string to anything.
cfg.__(dot)
This represents an arbitrary symbol string of length exactly dot, where dot is an unsigned integer. Like cfg.__, the substring matched is never bound to any variable. Note: the value of dot is not passed by value. dot is passed by reference and a pointer to dot is stored. As such, the value of dot is resolved each time the pattern is invoked. That is, as dot changes, so does the pattern. Clearly, if dot lives on the stack and a pattern is returned to a lower stack frame then one will encounter undefined behaviour. This will be a recurring theme—the FLTL does not guard against such undefined behaviour as that would take away from the expressiveness of the pattern-matching facilities.
--->*
This abuse of C++'s post-decrement and member-pointed-to-by-object-pointed-to operators signifies the → relation of CFGs that relates grammar variables to strings of terminals and non-terminals to form the productions of a grammar. The -- operator creates and allocates the actual pattern data and the ->* operator adds in the first pattern part to the pattern.
a
The variable a, as part of a pattern, represents a bound variable. As the pattern is being matched against a string, if the next symbol in the string is the same as the symbol whose value a has then the pattern part succeeds. That is, if a is the terminal t, then the pattern part a only succeeds if it is being matched against t in the production. Similar to cfg.__(dot), a pointer to a is stored rather than the value of a so that the pattern sees all changes to a. Bound pattern parts don't only work for symbols and stack variables: they also work for symbol strings and references to symbols stored elsewhere.
~A
The variable A has type variable_type and so this expression represents an unbound variable. As the pattern is being matched against a string, if the next symbol to be matched is a variable and the next pattern part is an unbound variable then we succeed at matching the pattern part and bind the variable symbol to the unbound variable. That is, if ~A appears in a pattern and is the pattern part that must match "S" in the string "αSβ" and "S" is a variable then the variable A will be assigned the value "S". Similar to cfg.__(dot), a pointer to A is stored in order to perform this binding. If A had type terminal_type then it can only be bound to grammar terminals. If A had type symbol_type then it can be bound to any grammar symbol.
+
The implied meaning of the overloaded infix + operator represents pattern part concatenation. The actual meaning is juxtaposition, as no real concatenation is being done behind the scenes. The + operator can only be used on the right-hand side of a --->*. Outside of patterns, the + operator has the explicit meaning of concatenation. Thus, for the pattern "A --->* a + b + c + d", the subtle change of "A --->* a + (b + c) + d" is a completely different pattern that behaves differently and is potentially unsafe. It is unfortunate that patterns depend so heavily on operator precedence; however, such is life when one is abusing operator overloading.

Using Patterns

Patterns can be used "literally" or assigned to named locations and used multiple times. The below example is from my Earley parser where I use the above named locations for patterns to decide which action the parser should take. Note that the first thing I do is change the value stored at the location named by dot. Every pattern sees this change in value.

// move the dot
dot = item->dot;

// the item has the form A → … • B …
if(predictor.match(curr_item->production)) {

    // if B is nullable then add A → … B • … to
    // the item set
    if(is_nullable[B.number()]) {
        // …
    }

    // …

// the item has the form A → … •
} else if(completer.match(curr_item->production)) {

    // …

// try to "solve" this terminal
} else if(not_at_end) {

    // we don't know the terminal of this lexeme, lets
    // see if we can substitute a variable terminal
    if(solve_for_variable_terminal) {

        // try to match this production as
        // A → … • a … for some variable terminal
        // a.
        if(!scanner_unknown.match(curr_item->production)
        || !cfg.is_variable_terminal(a)) {
            continue;
        }
        
        // …

    // we know the terminal of this lexeme
    } else {

        // try to match this production as
        // A → … • a … where "a" is the terminal
        // of the current lexeme.
        if(!scanner_known.match(curr_item->production)) {
            continue;
        }
    }
    
    // …
}
    

Implementation Details

Each pattern has at least two things: a pointer to a fixed-length array (length 16) of void * pointers, and a function pointer that points to a function taking in a pointer to the aforementioned array and a reference to a production and returning a Boolean value. This limits patterns to have no more than 16 parts; however, I have not had to express any patterns nearly as complicated as that.

How could this scheme possibly be anywhere close to type safe? From a high-level perspective, the key is that we always pass around this pair of pointers together and never mix and match pointers from different patterns. From a low-level perspective, the real key is the function: each pattern is associated with a unique function pointer up to equivalence of pattern-part types.

Each of the above pattern expressions is associated with a type tag. For example, the pattern part A has the tag variable_tag, while ~A has the tag unbound_variable_tag, and cfg.__ has the tag any_symbol_string_tag, while cfg.__(dot) has the tag any_symbol_string_of_length_tag. Taken together with the overloaded ->* and + operators, a type tree of these tags is built up at compile time. Type trees of this kind are used extensively in various Boost libraries. A nice explanation of them can be found in this blog post by Matt Might.

The function pointer contained in a pattern is a pointer to a static member function of an instantiated C++ template, where one of the type parameters to the template is the type tag tree. In this static member function, other static methods are called on smaller and smaller parts of the pattern, where each smaller part pulls off a leftmost child of the type tree that can be found and contains the code needed to match the suffix of some symbol string against the remainder of the pattern. Each of these static methods are annotated with the inline function specifier to hint to the compiler that the pattern-matching code expanded through template instantiation should be flattened into a single function.

As you might have guessed, each slot in the array of pointers corresponds to one of the instantiated static member functions. Each member function ensures, by knowing the tag associated with the pattern part, that the pointer contained in one of the slots of the array is used in a type-safe way.

Pattern Matching in PDAs

Patterns aren't limited to CFGs: both PDAs and NFAs support patterns; however, the syntax used is much simpler. Below is an example of PDA generator patterns, as used in the construction of a CFG from a PDA:

// if there is a transition from p to r that brings the PDA from
// an empty stack to a stack with t on it and reads a...
typename PDA::generator_type p_to_r_push(pda.search(
    ~p,
    ~a,
    pda.epsilon(),
    ~t,
    ~r
));

/// ... and if there is a transition from s to q that pops t off
/// of the stack when reading b
typename PDA::generator_type s_to_q_pop(pda.search(
    ~s,
    ~b,
    t,
    pda.epsilon(),
    ~q
));
    

The ~ operator has the same meaning as it does in CFG patterns. The five parameters to the search method are source state, input symbol, pop symbol, push symbol, and sink state, respectively.

Final Thoughts

Patterns in FLTL can be extremely expressive. Extra effort has been put to hiding the internals (namely: template hell) of patterns away behind a nice interface. Patterns allow one to work with the FLTL data structures declaratively, a major step up from the previous implementations of Grail+. However, when using CFG patterns, one must be aware of operator precedence and the lifetime of variables bound by patterns.


Comments

There are no comments, but you look like you have something on your mind.


Comment