Stackless Operator Precedence

I got operator precedence for my parser working properly yesterday and am well into the interpreter (it can interpret expressions!). I thought that the way I solved the problem of generating a tree from some infix expressions was both clever and hackish. The normal solution involves creating a stack, pushing a frame in, parsing any operands/operators we find and putting them into a parse tree on that stack frame. If a '(' is found then a new frame is pushed onto the stack and all operations within the sub-expression are put on the parse tree of the newer stack frame. When a ')' is found, the topmost frame is popped off the stack and its parse tree is merged into the new top frame's parse tree. Interestingly, I managed to achieve the same effect without a stack at all.

The key to this is the tokenizer. My tokenizer is iterative, so I only parse the input text for a token when I ask for a token. For a simple calculator, a tokenizer would not be necessary; however, it was for the language. I have an enumeration of all the possible tokens for my language. I conveniently set it up so that the first token is an unknown token (commonly used as the right link in a cons cell), and then the first twenty tokens are all infix operators. I have a function to register operators along with their left and right precedences, and so they are all included in a convenient operators array. So far it seems as though the tokens are irrelevant. I will explain.

My parse tree is a binary tree laid out using cons cells. Eg: 1 + 2 * 3 becomes:

     (1)  ()
     (2)   ()

Each empty () is a node with a token of SW_TOKEN_UNKNOWN (token: 0), both + and * are nodes on the tree with token symbols whose values are less than twenty. So, this just shows how I have the parse trees made. Each node knows its parent node, and each node has a left and right child (one or both of which might be null), and whenever a node is added as a child to another, a new leaf is created and the parser is aware of this leaf[1]. Given this information, we can start onto a stackless model.

Imagine that the parser has just accepted the infix operator *. A useful assumption to make is that the parser works properly and that any prior infix operators--in this case adding 1 + 2--have been parsed into the tree correctly. We find the * operator, and given the handy operators table that I mentioned above, we know its left and right binding powers. Since we know what leaf we're on, we can move upward in the tree looking for nodes whose token symbols are <= 20 (unknown nodes and infix operator nodes). If we find a node that has a token symbol greater than twenty, we stop. If we find a node whose right binding power is greater[2] than the left binding power of the infix node that we have parser (the *) then we will record it, overwriting any previously recorded nodes of this sort. If we find a node whose right binding power is less than or equal to the left binding power of the parsed infix operator then we break out of the loop and hope that we've recorded an infix operator in a previous iteration of the loop.

Okay, so that seems like an odd process. Here is another way of looking at this. We have found an infix operator. If no prior infix operators have been parsed then our job is easy: it becomes the root to whatever was on the left of it and whatever is to the right of it. If we have found prior operators then our job is tougher. We need to find exactly which previous operator to bind this with. Consider this: 2^3^4+5. We need to turn this into (2^(3^4))+5. Since the operators effectively become the roots to the expressions on either side of them, then we can say that the + needs to bind with the first ^. An exponent operator has a higher binding power than an addition operator, and it also binds to the expressions to the right of it more strongly then it binds to the expressions to the left of it--hence why we get 2^(3^4) and not (2^3)^4.

When the + is found, we move up in the tree. We find the second ^, and realize that it has a higher binding power than the + operator. This is good because the + might bind to it. We record the second ^ and keep going up the tree. We find the first ^, it has a higher binding power, so we record it. We try to move up the tree--however there is no parent node so the loop is stopped. The node to bind with was the last one recorded--the first ^. Because the ^ has a higher binding power than the +, it means every exponent operation needs to happen before any addition operation. So, we make the tree of 2^(3^4) the child of our + operator. The case is reversed when an operator is found with the same or a lower binding power. The + operator and its left/right expressions become the children of that operator.

Now, on to the key to stackless sub-expressions. This whole scheme works because infix operators' token symbols are represented by a known range of numbers (1-20, including 0 as empty nodes). The above algorithm works by looking for branches with token symbols in that range. If it stumbles upon a branch with any symbol outside of that range then it will simply stop. So, to deal with sub-expressions, all that is needed is to attach another node to the last known leaf of the tree! This way no stack is needed and no merging back onto the tree is needed because the sub-expression is already part of the tree! It works because the sub-expression node, with a token symbol out of the infix operator's range, represents a road block to any infix operations going on within it (acting as the root of the sub-expression).

1. At first, the choice to use cons cells was random. It has, however, paid off significantly. Since every node is a cons cell, every branch has the same format, and all sub-nodes of the branch also follow this format. When a node is added, a cons cell is made. That means that a blank node added as the right child to the parent node. The node we want to add is set as the left child of the cons cell and the right side of the cons cell is set to null.

2. This may seem backward depending on how you have implemented operator precedence in the past. My precedence--or rather: priority-- goes from weak (1) to strong (10).


There are no comments, but you look like you have something on your mind.