RSS Feed

Peter Goodman's blog about PHP, Parsing Theory, C++, Functional Programming, Applications,

Traditional Parsing Methods

One parsing technique that I sometimes use is Top Down Operator Precedence Parsing (TDOP). TDOP parsers have been discussed in many other places as well. Unfortunately, I have not seen TDOP described in terms of left-corner parsing (except for a passing comment in this thesis).

The purpose of this post is to set the stage for a later discussion about TDOP parsing. This post will introduce top-down and bottom-up parsing, then combine the two methods to introduce left-corner parsing. Also, the top-down parsing language (TDPL) will be briefly mentioned as its semantics relate to TDOP.

Traditional Parsing Methods

Before getting into TDOP, it's important to have at least some background in non-TDOP parsing methods. This is because TDOP can be understood as a combination of several different parsing methods.

Parsing is a language acceptance problem. That is, a parser is a function that accepts or rejects a string. If a parser accepts a string then we say that string is in some language. The opposite is said of rejection. A string in this case means a sequence of zero or more symbols. In the English language, symbols are Latin/alphabetic characters. In the C programming language, symbols are reserved words, variables, literals, and punctuation (e.g. void, "foo", >, etc.).

Typically, a parser accepts the language generated by a context-free grammar (CFG). CFGs are a formalism for describing some languages. The following is an example CFG that generates simple arithmetic expressions:

E → "(" E ")"
E → A

A → M "+" A
A → M "-" A
A → "-" A
A → M

M → N "×" M
M → N "÷" M
M → N

N → "0"
N → "1"
  ⋮
N → "10"

Note: ignore the unusual placement of the parentheses and the right-associativity of the operators described by the grammar.

The name to the left of the is called a variable or a non-terminal. Something in quotes is called a token, or terminal. Both terminals and non-terminals are considered symbols. Terminals can be thought of as the letters of one's language.

The itself is a relation which says that non-terminal on the left-hand side can generate the language on the right-hand side. This combination is called a production.

Note: the rest of this article will focus on parsing strings from left-to-right. The following examples detailing various parsing methods assume that our parsers alway guess correctly. Finally, we assume that our grammars are ε-free. That is, the right-hand side of a production is never empty (with one exception).

Top-Down Parsing

As its name implies, top-down parsing proceeds top-down. In the case of the above expression grammar, the "top" starts off as E. The action of going "down" involves one of two things:

  1. Replacing a non-terminal with something that it is related to (the right-hand side of ).
  2. Consuming a terminal.

Right-hand sides of productions contain both terminals and non-terminals. In replacing a non-terminal with ones of its right-hand sides, we set up expectations about the structure of later parts of the string. For example, suppose we want to parse "(2 × 3)". Parsing will proceed as follows:

Step Action Expectations Remainder of string
1 start E (2 × 3)
2 replace E → "(" E ")" "(" E ")" (2 × 3)
3 consume "(" E ")" 2 × 3)
4 replace E → A A ")" 2 × 3)
5 replace A → M M ")" 2 × 3)
6 replace M → N "×" M N "×" M ")" 2 × 3)
7 replace N → "2" "2" "×" M ")" 2 × 3)
8 consume "2" "×" M ")" × 3)
9 consume "×" M ")" 3)
10 replace M → N N ")" 3)
11 replace N → "3" "3" ")" 3)
12 consume "3" ")" )
13 consume ")"
accept

If—as a side-effect of parsing a string—one wanted to build a parse tree, then the order of constructing nodes in the parse tree would be as follows:

Top-Down Parsing Language

Brief mention needs to be given to the top-down parsing language (TDPL). The TDPL formalizes the behavior of many top-down parsers. A key difference between a TDPL grammar and a CFG is that productions are totally ordered in a TDPL grammar.

For example, if the productions of the above CFG were totally ordered according to their text order, then a parser cannot try the second production (E → A) without first failing to parse according to the first production (E → "(" E ")").

Bottom-Up Parsing

We can characterize top-down parsers as making "global" decisions. Their expectations about the future structure of the as-of-yet unseen parts of the string are evidence of this. On the other hand, bottom-up parsers operate "locally". That is, they make decisions based only on the structure of the part of the string that they have already seen.

The consequence of local decision making is that bottom-up parsers discover sub-structures of the parsed string before they discover super/structures. In theory, a bottom-up parser has no expectations about the remainder of the string to be parsed. In practice, common bottom-up parsers implicitly make use of top-down information.

Bottom-up parsers typically perform two main actions: shift and reduce.

  1. Shifting is similar to consuming to the extent that our cursor into the string being parsed moves forward by one symbol. This is equivalent to removing the first symbol of the input string.

    Unlike top-down parsers, bottom-up parsers do not maintain a sequence of expectations. Instead, they operate on a partially parsed substring of the input string.

    Shifting involves taking the first symbol from remainder of the input string and appending it to the end of the partially parsed string.
  2. Reducing operates on a suffix of the partially parsed string. A reduction involves taking a suffix of the partially parsed string, matching it against the right-hand side of a production, and then replacing it with the left-hand side of a production (non-terminal).

For example, suppose we want to parse "(2 × 3)". Parsing will proceed as follows:

Step Action Partial parse Remainder of string
1 start (2 × 3)
2 shift "(" "(" 2 × 3)
3 shift "2" "(" "2" × 3)
4 reduce N → "2" "(" N × 3)
5 shift "×" "(" N "×" 3)
6 shift "3" "(" N "×" "3" )
7 reduce N → "3" "(" N "×" N )
8 reduce M → N "(" N "×" M )
9 reduce M → N "×" M "(" M )
10 reduce A → M "(" A )
11 reduce E → A "(" E )
12 shift ")" "(" E ")"
13 reduce E → "(" E ")" E
accept

If—as a side-effect of parsing a string—one wanted to build a parse tree, then the order of constructing nodes in the parse tree would be as follows:

Left-Corner Parsing

Left-corner parsing (LC) is a parsing technique that makes decisions based on top-down and bottom-up information.

In the case of the bottom-up parser above, it appears that we were lucky that the sequence of shifts and reductions ended up reducing the entire string to an E. Strictly speaking, the goal of the above bottom-up parser was exactly that: reduce a string to E. If our expression were very long, then it wouldn't be clear until near the end of a bottom-up parse that our parser might have a chance of reaching its goal of E.

An LC parser attempts to satisfy multiple goals, including the end goal of reducing the string to E. An LC parser predicts substructures present in the remainder of the string, and attempts to parse those sub-structures bottom-up. But the prediction step sets up expectations about the structure of unseen parts of the string, which is a top-down approach.

In fact, LC parsers alternate between bottom-up and top-down parsing. Alternation is possible because an LC parser maintains a list of goals (analogous to our top-down expectations), a list of predictions, and a partial parse of the input string (as in a bottom-up parser). An LC parser operates on its input string and these three lists in the following way:

  1. Repeat:
    1. If the head of the goal list is a terminal, then consume the terminal and shift the first symbol of the remainder of the input string onto the end of the partial parse. If the goal terminal does not match the first symbol of the string then reject.

      If the head of the goal list is a non-terminal, then attempt to reduce a suffix of the partial parse to the to the goal non-terminal. If such a reduction is possible, then remove the non-terminal from the head of goal list and update the partial parse accordingly.

      This step is repeated until the goal list remains unchanged.
    2. If β is the last symbol of the partial parse, then find a production of the form "α → β γ" where γ is a string of zero-or-more symbols. β is said to be a left corner of α. Left corners can be both terminals and non-terminals. If we weren't restricting ourselves to ε-free CFGs, then left corners do not necessarily appear immediately following the "→"!

      Place γ and α on the head of the goal list, so that the first symbol (if any) of γ is our next goal.

      If the goals list is changed then return to the step 1.1.
    3. If neither of the previous two steps changed the goals list, then shift a symbol from the remainder of the input string onto the end of the partial parse.

      If no such symbol can be shifted, then reject the string. Otherwise, return to step 1.2.
  2. Stop when the goal list is empty.

For example, suppose we want to parse "(2 × 3)". Parsing will proceed as follows:

Step Action Goals Partial parse Remainder of string
1 start (2 × 3)
2 (1.1) no change to goals list
(2.2) no change to goals list
shift "(" 2 × 3)
3 (1.1) no change to goals list
corner E"(" E ")" E ")" E "(" 2 × 3)
4 (1.1) no change to goals list
(1.2) no change to goals list
shift "2" E ")" E "(" "2" × 3)
5 (1.1) no change to goals list
corner N"2" N E ")" E "(" "2" × 3)
6 reduce N → "2" E ")" E "(" N × 3)
7 (1.1) no change to goals list
corner MN "×" M "×" M M E ")" E "(" N × 3)
8 consume "×" M M E ")" E "(" N "×" 3)
9 (1.1) no change to goals list
(1.2) no change to goals list
shift "3" M M E ")" E "(" N "×" "3" )
Step Action Goals Partial parse Remainder of string
10 (1.1) no change to goals list
corner N"3" N M M E ")" E "(" N "×" "3" )
11 reduce N → "3" M M E ")" E "(" N "×" N )
12 reduce M → N M E ")" E "(" N "×" M )
13 reduce M → N "×" M E ")" E "(" M )
14 (1.1) no change to goals list
corner AM A E ")" E "(" M )
15 reduce A → M E ")" E "(" A )
16 reduce E → A ")" E "(" E )
17 consume ")" E "(" E ")"
18 reduce E → "(" E ")" E
accept

If—as a side-effect of parsing a string—one wanted to build a parse tree, then the order of constructing nodes in the parse tree would be as follows:

Compared to the other two methods, this seems like a lot of work for nothing! Also, there is some amount of magic happening: recall that we are operating under the assumption that every action taken will be the correct one. In practice, one constructs a table and "cheats" when deciding which actions to take.

Summary

Top-down and bottom-up parsing were covered to set the stage for left-corner parsing and the TDPL, which provide context for the behavior of TDOP parsers. My next post will go into TDOP and how it relates to left-corner parsing and the TDPL.


Comments

There are no comments, but you look like you have something on your mind.


Comment