RSS Feed

Peter Goodman's blog about PHP, Parsing Theory, C++, Functional Programming, Applications,

It is with great sadness that I report the passing of my friend, colleague, and mentor: Dr. Sheng Yu. I knew Sheng in the past three and a half years of his life. Sheng was twice my professor, twice my employer, and my undergraduate thesis supervisor.

Often I would pop in to Sheng's office on the third floor of Middlesex College at The University of Western Ontario. On his desks were towers of books and papers; it baffled me that they never fell. In his office, we would talk--sometimes for hours--about his past students and what they were up to, about parsing techniques, finite automata, regular languages and their operations, and object-oriented programming.

I prefaced each of our e-mail correspondences with the far too formal "Prof. Sheng Yu". Goodbye Prof. Sheng Yu; you will be missed.

Update: Grail+ is now on GitHub at https://github.com/pgoodman/Grail-Plus.

Well, I've finally submitted my undergraduate thesis project's final report. My project was to develop the newest version of Grail+. Grail+ is a set of command line tools for manipulating non-deterministic finite automata (ε-NFAs), non-deterministic pushdown automata (ε-NPDAs), and context-free grammars (CFGs). Grail+ is built on top of the Formal Language Template Library (FLTL), a library that I developed for representing and symbolically manipulating CFGs, ε-NFAs, and ε-NPDAs. Over the past several months I've worked hard and built Grail+ and the FLTL from the ground up. Together, they represent around 12,000 lines of C++.

My report can be found here. The report is 49 pages long. For anyone reading this blog, the most interesting part of the report is the implementation discussion. Unfortunately, I had to leave a lot out of the report as it is already quite long. As such, the API described in the report is incomplete and some of the interesting discussions were cut short.

I have licensed Grail+ under the MIT License. I am interested in collaborating with others to continue the development of the project. The source code of Grail+ can be found here.

My undergraduate thesis work on Grail+ is almost complete. As part of this work, I developed the formal language template library (FLTL). The library contains generic representations for context-free grammars (CFGs), non-deterministic pushdown automata (PDAs), and epsilon non-deterministic finite automata (NFAs). The focus of the FLTL is to provide the basic operations needed to symbolically manipulate the aforementioned formalisms—not raw performance. This post will focus on a particularly powerful feature of the FLTL: pattern matching.

Pattern Matching in CFGs

The following are some examples of CFG production patterns that I use in the Earley parser of Grail+. They will be explained below:

// variables used in the patterns
unsigned dot;
variable_type A;
variable_type B;
terminal_type a;
production_type prod;

pattern_type scanner_known((~A) --->* cfg.__(dot) + a + cfg.__);
pattern_type scanner_unknown((~A) --->* cfg.__(dot) + ~a + cfg.__);
pattern_type predictor((~A) --->* cfg.__(dot) + ~B + cfg.__);
pattern_type completer((~A) --->* cfg.__(dot));

generator_type predictor_related(cfg.search(~prod, B --->* cfg.__));
pattern_type completer_related(cfg._ --->* cfg.__(dot) + A + cfg.__);

    

There are two key types here: pattern_type and generator_type. An individual pattern, represented by pattern_type, is able to match, destructure, and bind variables to parts of a production. A generator, represented by generator_type, is a more general abstraction (similar to iterators) that can operate using patterns, among other things.

A number of different syntactical elements are being used: cfg._, cfg.__, cfg.__(dot), --->*, a, ~A, and +.

cfg._
This represents an arbitrary symbol. A symbol is either a variable/non-terminal or a terminal of the grammar on which we are operating. This pattern component will successfully match against any symbol, but will not bind the value of the symbol to anything.
cfg.__
This represents an arbitrary symbol string of any length. This pattern component will always match, even if one is attempting to match an empty symbol string, and will never bind the matched string to anything.
cfg.__(dot)
This represents an arbitrary symbol string of length exactly dot, where dot is an unsigned integer. Like cfg.__, the substring matched is never bound to any variable. Note: the value of dot is not passed by value. dot is passed by reference and a pointer to dot is stored. As such, the value of dot is resolved each time the pattern is invoked. That is, as dot changes, so does the pattern. Clearly, if dot lives on the stack and a pattern is returned to a lower stack frame then one will encounter undefined behaviour. This will be a recurring theme—the FLTL does not guard against such undefined behaviour as that would take away from the expressiveness of the pattern-matching facilities.
--->*
This abuse of C++'s post-decrement and member-pointed-to-by-object-pointed-to operators signifies the → relation of CFGs that relates grammar variables to strings of terminals and non-terminals to form the productions of a grammar. The -- operator creates and allocates the actual pattern data and the ->* operator adds in the first pattern part to the pattern.
a
The variable a, as part of a pattern, represents a bound variable. As the pattern is being matched against a string, if the next symbol in the string is the same as the symbol whose value a has then the pattern part succeeds. That is, if a is the terminal t, then the pattern part a only succeeds if it is being matched against t in the production. Similar to cfg.__(dot), a pointer to a is stored rather than the value of a so that the pattern sees all changes to a. Bound pattern parts don't only work for symbols and stack variables: they also work for symbol strings and references to symbols stored elsewhere.
~A
The variable A has type variable_type and so this expression represents an unbound variable. As the pattern is being matched against a string, if the next symbol to be matched is a variable and the next pattern part is an unbound variable then we succeed at matching the pattern part and bind the variable symbol to the unbound variable. That is, if ~A appears in a pattern and is the pattern part that must match "S" in the string "αSβ" and "S" is a variable then the variable A will be assigned the value "S". Similar to cfg.__(dot), a pointer to A is stored in order to perform this binding. If A had type terminal_type then it can only be bound to grammar terminals. If A had type symbol_type then it can be bound to any grammar symbol.
+
The implied meaning of the overloaded infix + operator represents pattern part concatenation. The actual meaning is juxtaposition, as no real concatenation is being done behind the scenes. The + operator can only be used on the right-hand side of a --->*. Outside of patterns, the + operator has the explicit meaning of concatenation. Thus, for the pattern "A --->* a + b + c + d", the subtle change of "A --->* a + (b + c) + d" is a completely different pattern that behaves differently and is potentially unsafe. It is unfortunate that patterns depend so heavily on operator precedence; however, such is life when one is abusing operator overloading.

Using Patterns

Patterns can be used "literally" or assigned to named locations and used multiple times. The below example is from my Earley parser where I use the above named locations for patterns to decide which action the parser should take. Note that the first thing I do is change the value stored at the location named by dot. Every pattern sees this change in value.

// move the dot
dot = item->dot;

// the item has the form A → … • B …
if(predictor.match(curr_item->production)) {

    // if B is nullable then add A → … B • … to
    // the item set
    if(is_nullable[B.number()]) {
        // …
    }

    // …

// the item has the form A → … •
} else if(completer.match(curr_item->production)) {

    // …

// try to "solve" this terminal
} else if(not_at_end) {

    // we don't know the terminal of this lexeme, lets
    // see if we can substitute a variable terminal
    if(solve_for_variable_terminal) {

        // try to match this production as
        // A → … • a … for some variable terminal
        // a.
        if(!scanner_unknown.match(curr_item->production)
        || !cfg.is_variable_terminal(a)) {
            continue;
        }
        
        // …

    // we know the terminal of this lexeme
    } else {

        // try to match this production as
        // A → … • a … where "a" is the terminal
        // of the current lexeme.
        if(!scanner_known.match(curr_item->production)) {
            continue;
        }
    }
    
    // …
}
    

Implementation Details

Each pattern has at least two things: a pointer to a fixed-length array (length 16) of void * pointers, and a function pointer that points to a function taking in a pointer to the aforementioned array and a reference to a production and returning a Boolean value. This limits patterns to have no more than 16 parts; however, I have not had to express any patterns nearly as complicated as that.

How could this scheme possibly be anywhere close to type safe? From a high-level perspective, the key is that we always pass around this pair of pointers together and never mix and match pointers from different patterns. From a low-level perspective, the real key is the function: each pattern is associated with a unique function pointer up to equivalence of pattern-part types.

Each of the above pattern expressions is associated with a type tag. For example, the pattern part A has the tag variable_tag, while ~A has the tag unbound_variable_tag, and cfg.__ has the tag any_symbol_string_tag, while cfg.__(dot) has the tag any_symbol_string_of_length_tag. Taken together with the overloaded ->* and + operators, a type tree of these tags is built up at compile time. Type trees of this kind are used extensively in various Boost libraries. A nice explanation of them can be found in this blog post by Matt Might.

The function pointer contained in a pattern is a pointer to a static member function of an instantiated C++ template, where one of the type parameters to the template is the type tag tree. In this static member function, other static methods are called on smaller and smaller parts of the pattern, where each smaller part pulls off a leftmost child of the type tree that can be found and contains the code needed to match the suffix of some symbol string against the remainder of the pattern. Each of these static methods are annotated with the inline function specifier to hint to the compiler that the pattern-matching code expanded through template instantiation should be flattened into a single function.

As you might have guessed, each slot in the array of pointers corresponds to one of the instantiated static member functions. Each member function ensures, by knowing the tag associated with the pattern part, that the pointer contained in one of the slots of the array is used in a type-safe way.

Pattern Matching in PDAs

Patterns aren't limited to CFGs: both PDAs and NFAs support patterns; however, the syntax used is much simpler. Below is an example of PDA generator patterns, as used in the construction of a CFG from a PDA:

// if there is a transition from p to r that brings the PDA from
// an empty stack to a stack with t on it and reads a...
typename PDA::generator_type p_to_r_push(pda.search(
    ~p,
    ~a,
    pda.epsilon(),
    ~t,
    ~r
));

/// ... and if there is a transition from s to q that pops t off
/// of the stack when reading b
typename PDA::generator_type s_to_q_pop(pda.search(
    ~s,
    ~b,
    t,
    pda.epsilon(),
    ~q
));
    

The ~ operator has the same meaning as it does in CFG patterns. The five parameters to the search method are source state, input symbol, pop symbol, push symbol, and sink state, respectively.

Final Thoughts

Patterns in FLTL can be extremely expressive. Extra effort has been put to hiding the internals (namely: template hell) of patterns away behind a nice interface. Patterns allow one to work with the FLTL data structures declaratively, a major step up from the previous implementations of Grail+. However, when using CFG patterns, one must be aware of operator precedence and the lifetime of variables bound by patterns.

Recently I have been doing a lot of C++ programming. Some of the projects I was and currently am working on include:

Through these projects, I've exposed myself to more C++ and I am finding that I (gasp!) like using it in some of my programming activities. That is not to say that C++ is my tool of choice for all projects, but for the above three it certainly was.

Most recently, I have been working on the FLTL and part of the requirements of the FLTL is that it work on a number of platforms. There are usually two main problems that come up when I try to make this work:

  • Compiler errors.
  • Unexpected runtime errors.
In this post I will not address the former as it is less interesting. Instead, I will talk about a very simple testing system I made to help me discover the latter. Usually I do not write many tests that I can have automatically run. This is likely a failing on my part; sometimes I don't write tests because I'm lazy and other times because that would mean I'm not working directly on the code that I really want to be working on. In any event, it was necessary to write tests for this project as I discovered some some bugs that manifest on the various platforms on which I am testing my code.

To start things off, the following code illustrates how I use the system.

fltl/test/cfg/CFG.hpp
#ifndef FLTL_CFG_HPP_
#define FLTL_CFG_HPP_

#include "fltl/test/Test.hpp"
#include "fltl/lib/CFG.hpp"

namespace fltl { namespace test { namespace cfg {

    FLTL_TEST_CATEGORY(test_equivalence_relations,
        "Test for equivalence of variables, terminals, symbols, and symbol strings."
    );

    FLTL_TEST_CATEGORY(test_productions,
        "Test that productions are correctly added to the grammar and that duplicates are ignored."
    );
}}}

#endif /* FLTL_CFG_HPP_ */
fltl/test/cfg/CFG.cpp
#include "fltl/test/cfg/CFG.hpp"

namespace fltl { namespace test { namespace cfg {

    void test_equivalence_relations(void) throw() {

        using fltl::lib::CFG;

        CFG<char> cfg;

        CFG<char>::var_t S(cfg.add_variable());
        CFG<char>::var_t T(cfg.add_variable());

        CFG<char>::term_t a(cfg.get_terminal('a'));

        /* ... */

        CFG<char>::sym_t S_sym(S);
        CFG<char>::sym_t a_sym(a);

        FLTL_TEST_EQUAL(S,S);
        FLTL_TEST_EQUAL(S, S_sym);
        FLTL_TEST_EQUAL(S_sym, S);
        FLTL_TEST_NOT_EQUAL(S, T);
        FLTL_TEST_NOT_EQUAL(S, a);
        FLTL_TEST_NOT_EQUAL(a, S);

        /* ... */
    }
}}}
main.cpp
#include "fltl/test/Test.hpp"
#include "fltl/test/cfg/CFG.hpp"

int main(void) {

    fltl::test::run_tests();

    return 0;
}

Hosted by imgur.com

As is shown above, test cases are contained within functions. A function can contain as many tests cases as one likes. The idea is that the function represents a category of test cases. Then, to add a category of test cases to the system, one simply declares (in a header file) that a function is a test category (using the FLTL_TEST_CATEGORY macro).

The way this works is fairly straightforward:

  • Test cases are basically assertions. If the condition of the test case is true then the test succeeds, otherwise the test fails. Successes and failures are recorded and printed out.
  • Each test category declaration declares the function and also a static variable whose type is parameterized over the function. When static variable constructors are called at runtime, the test case category types chain themselves into a linked list.
  • When we want to run the tests, we invoke fltl::test::run_tests() and that loops through the linked list and invokes each test case category function.

The code that includes the testing macros and types can be found here:

prev     next