Unger Parsing

posted by Darryl on 6 Mar 2015

In this third post in my series on parsing inputs that might have errors (previous post here), I'm going to talk about Unger parsing. Unger parsing is a simple parsing technique that can handle ambiguous grammars quite nicely, and turns out to also be rather well suited to handling errors as well. I'll introduce Unger parsing, since I suspect most people are unfamiliar with it, then discuss it's application to error handling, and then discuss its drawbacks.

What is Unger Parsing?

Suppose we have a simple context free grammar for mathematical expressions such as the following:

exp → var
exp → exp op exp
var → x
var → y
var → z
op → +
op → -

As you might expect, this generates sentences such as x + y, and z - z + z, and so forth. The idea behind Unger parsing is to take the input string — let's use x + y — and try each possible rule for the given goal symbol by splitting the string into however many parts as there are symbols in the right hand side of the rule, and then recursing. So for example, if we try to parse x + y as an exp, we might try the first rule, exp → var which says we need to parse x + y as a single substring labeled var. Then we would try a rule for var, such as the rule var → x, and again we parse x + y as a single string, containing the symbol x, which fails.

We might have instead chosen to parse with the second rule for exp, which says to parse x + y as three strings labeled exp, op, and exp, in that order. We can split x + y into three strings in various ways due to the empty string, but one of them, which splits it as x, +, and y will succeed. Variants of this algorithm disallow splitting into the empty string, but we want to include it.

This process is possibly better visualized using some tables, so here's the first, failed parse tabularly. We start with our goal parse, spanning the whole input, like so:

Goal:      exp
          =====
Input:    x + y

After picking the erroneous rule exp → var, we get the new goal var, which we'll add below the old goal, like so

           exp
          -----
Goal:      var
          =====
Input:    x + y

We then expand again, with var → x, which fails because the goal — a string of terminals — is not equal to the part of the input it covers:

           exp
          -----
           var
          -----
Goal:       x
          =====
Input:    x + y

On the other hand, the successful parse unfolds like so: we start as we did before, with the goal as exp, but now we expand with the rule exp → exp op exp. Because the right hand side of this has three symbols, we now have three new goals, and so we have to split the input into three substrings. We can show all the possible ways to split the input below the double line:

                 exp
          -----+-----+-----
Goal:      exp | op  | exp
          =================
Input:     x+y |     |
           x+  | y   |
           x+  |     |  y
           x   | +y  |
           x   | +   |  y
           x   |     | +y
               | x+y |
               | x+  | y
               | x   | +y
               |     | x+y

Now for each of our sub-goals, we repeat this process. Let's pick randomly to expand op first, and let's choose the rule op → +:

                 exp
          -----+-----+-----
           exp | op  | exp
          -----+-----+-----
Goal:      exp | +   | exp
          =================
Input:     x+y |     |
           x+  | y   |
           x+  |     |  y
           x   | +y  |
           x   | +   |  y
           x   |     | +y
               | x+y |
               | x+  | y
               | x   | +y
               |     | x+y

Since this expands to a terminal, we can compare it against the middle column of the splits of the input. Only one of the splits has a middle column that's equal to +, so all the others are discarded:

                 exp
          -----+-----+-----
           exp | op  | exp
          -----+-----+-----
Goal:      exp | +   | exp
          =================
Input:     x   | +   |  y

If we continue expanding we could of course try the operator rule for expressions again, but that will of course fail. One choice, however, leads to the correct parse. Collapsing independent choices into a single row in the goal history, we end up with:

                 exp
          -----+-----+-----
           exp | op  | exp
          -----+-----+-----
           var | +   | var
          -----+     +-----
Goal:      x   |     | y
          =================
Input:     x   | +   |  y

Error Handling

This kind of parsing lends itself rather nicely to error handling precisely because it breaks the input into sub-strings which it assigns to the sub-goals. Suppose, for instance, that the input was actually x * y, where we had used an op that isn't defined in our grammar. We would get a variety of parses for the various splits of the input as before, but one of them looks like thiis:

                 exp
          -----+-----+-----
           exp | op  | exp
          -----+     +-----
           var |     | var
          -----+     +-----
Goal:      x   |     | y
          =================
Input:     x   | *   |  y

Trying to expand op fails to yield any successes because there's no rule for that, but wait! We want to record errors in the parse tree! So we could allow this to match as an error, thereby producing a parse tree with an error node! Perhaps annotated in the table with ERR like so:

                 exp
          -----+-----+-----
           exp | op  | exp
          -----+-----+-----
           var | ERR | var
          -----+     +-----
Goal:      x   |     | y
          =================
Input:     x   | *   |  y

Now we can happily produce some kind of parse tree that logs the error:

(exp (exp (var x))
     (op (ERR *))
     (exp (var y)))

Unger parsing therefore gives us a very nice way to handle parse errors.

So what's the problem?

The problem with Unger parsing is, it's horrendously inefficient. All that non-determinism in where to split the input string is a real nightmare. There are of course various ways to cope with this — I believe in fact one method can even produce a quadratic or cubic algorithm — but that won't help once error handling is introduced.

The example errorful parse that I showed above is a little misleading, as there are lots of other errorful parses as well, all of which are pretty bad. Here are three:

           exp
          -----
Goal:      var 
          -----
           ERR
          =====
Input:     x*y

Parse:    (exp (var (ERR x*y)))
                 exp
          -----+-----+-----
           exp | op  | exp
          -----+-----+-----
           var | ERR | var
          -----+     +-----
Goal:      ERR |     | y
          =================
Input:     x*  |     |  y

Parse:    (exp (exp (var (ERR x*)))
               (op (ERR))
               (exp (var y)))
                             exp
          -----+-----+-----------------------------
           exp | op  |            exp
          -----+-----+-----+-----+-----------------
           var | ERR | exp | op  |       exp
          -----+     +-----+-----+-----+-----+-----
           x   |     | var | ERR | exp | op  | exp
               |     +-----+     +-----+-----+-----
               |     | y   |     | var | ERR | var
               |     |     |     +-----+     +-----
               |     |     |     | ERR |     | ERR
          =========================================
Input:     x   | *   |  y  |     |     |     |

Parse:    (exp (exp (var x))
               (op (ERR *))
               (exp (exp (var y))
                    (op (ERR))
                    (exp (exp (var (ERR)))
                         (op (ERR))
                         (exp (var (ERR))))))

In fact, there are infinitely many ways to parse into an errorful tree. You might sum this up in a maxim: if there are many ways to be right, there are way more ways to be wrong.

You might be able to patch this up by bounding the number of empty substrings you try, and also by enforcing some kind of rule ordering so that you try simpler rules before more complex rules and discard complex rules when the simpler rules succeed (thereby blocking the entire third table there, with all that useless expansion on the right). But these all feel like hacks. The "right" answer, the one we'd really like to end up with, is the one I gave before, that simply notes that the symbol * doesn't parse as an op. We shouldn't even consider anything else, they should just be completely irrelevant to the process.

My next post will be on some more ideas, probably ones which are also inadequate. I haven't decided yet. This is all very much an experimenters notebook right now.

If you have comments or questions, get it touch. I'm @psygnisfive on Twitter, augur on freenode (in #languagengine and #haskell).