Parsing Malformed Input

posted by Darryl on 12 Feb 2015

In this post I want to talk about parsing inputs that might be malformed in some way. That is to say, this is about handling parse errors. Why might we want to do this? Well, if you're parsing a programming language, it's helpful to be able to inform the programmer why their code is malformed. If you're parsing a natural language, then there's a good chance that the speaker or writer won't provide perfectly grammatical sentences as input.

The Standard NLP Approach

In NLP, the typical way of coping with ungrammaticality is to use a grammar formalism that eliminates the concept of grammaticality entirely, replacing it with a measure of likeliness, typically a probabilistic context-free grammar (PCFG). All inputs are parsable, no matter what, but they have varying probabilities. The ones we humans consider more grammatical tend to be the more probable parses.

To some extent, this isn't so bad. After all, we do get a sense, as speakers of a language, that sentences can have shades of grammaticality. Even when we talk about sentences being ungrammatical, some ungrammatical sentences are worse than others. For instance, "Brutus stab Caesar" is way better than "the the the". Both are malformed, but you probably don't even consider the second one to be a candidate for being a sentence — it's just gibberish!

The NLP approach would parse both of these sentences happily, and just assign them lower probability than a nearby grammatical sentence. And "the the the" would probably get a much lower score than "Brutus stab Caesar". But this approach has a substantial flaw: because probabilistic parsing produces multiple candidates, we have to pick one, typically the most probable, but this is often not a very good or reasonable parse.

Consider, for example, the following ungrammatical sentence: "the man put the book the table". It's missing a preposition. Obviously it should be "the man put the put on the table", or something similar to that. But using the online version of the Stanford Parser, this input parses as

(NP
  (NP (DT the) (NN man) (NN put))
  (NP (DT the) (NN book))
  (NP (DT the) (NN table)))

That is to say, the Stanford Parser (at least the one from 2012) thinks the most likely parse for this is a complex noun phrase containing a sequence of three daughter noun phrases, and that's all. Bizarrely, if you change "the man" to "John", the parse is closer to being ideal:

(S
  (NP (NNP John))
  (VP (VBD put)
    (NP (DT the) (NN book))
    (NP (DT the) (NN table))))

And if you instead change "the book" to "it", it becomes .. a little better and a little worse:

(S
  (NP (DT the) (NN man))
  (VP (VBD put)
    (S
      (NP (PRP it))
      (NP (DT the) (NN table)))))

We now get a sentence with the main verb "put", but for some weird reason, its direct object is a sentence consisting of just two noun phrases. This is all rather weird, because the error is solely to do with a missing preposition, and yet only the middle parse is even close to highlighting that.

The Compiler Approach

Compilers for programming languages also have to cope with errors, but unlike natural language, programming languages are almost never ambiguous and have lots of delimiters that mark the boundaries of things. Parentheses, braces, keywords, commas, etc. all help to make parsing much simpler for compilers and interpreters to do. The typical response to an error is simply to stop parsing and spit out an error message. For instance, here's an error from the Haskell compiler GHC, when given the input foo = if x else y else z:

ParseError.hs:1:12:
    parse error on input ‘else’

The error tells us exactly where the first errorful position is, and what caused it. That's very useful, much more than simply saying the program has some low probability as in the NLP approach. But simply stopping at the first sign of a problem isn't ideal. It would be much nicer if the parser could continue parsing and tell us "you wrote else but you probably meant then". Or consider this parse error from the slightly different program if x y else z:

ParseError.hs:1:14:
    parse error on input ‘else’

Almost the same error, just at a different position. Wouldn't it be nicer if instead the parser could say "I think you're missing then in between x and y"?

Different Approaches for Different Tasks

The NLP and compiler approaches to error handling exist because they have typically very different goals in mind. A compiler simply cannot compile malformed source code into a program, so compilers take an all-or-nothing approach to error handling. On the other hand, most NLP applications can function rather well for their goals by having extremely fuzzy, probabilistic parsing. If you're doing sentiment analysis, for instance, you really don't care about correct parses, just certain structural properties. In fact, you don't even care about particular structural properties, you only care that some properties exist. Even if the parse trees are wrong, if they're all wrong in the same way, you can (hopefully!) classify sentences by sentiment.

But Language Engine is something of a hybrid task. It deals with natural language, and so naturally it has to cope with grammaticality more like the NLP approach — it has to fail gracefully. But because Language Engine interprets meanings of sentences and converts them to actions, it has to be fairly strict in what counts as grammatical, because otherwise you get sentences that have no meaning, like trying to write f(then). This calls for a hybrid approach to parse errors.

Example Errors

Before trying to tackle this problem, it'd be good to lay out what the ideal behavior should be that we would like to get out of such a parser. I'll give examples both for English and for a simple mathematical expression language.

Abstractly, the problem we might want to solve is as follows: given an input x that doesn't parse, can we find some alternative input y that does parse, such that editDistance(x,y) is minimized? In fact, this might directly suggest an algorithm:

if every x in I fails to parse
then build the set I' containing next(x) for each x in I,
     where next(x) is all of the one-edit changes to x,
     and try parsing again

For example, x+ fails to parse, so we instead make a single edit, producing x and x+y and try to parse these. Both succeed, so both are possible "corrections". If they had both failed, we could continue to the 2-step edits, and so on.

This algorithm might in fact be very effective, but it's going to recompute a lot of things, because as described, it reparses from scratch. With an efficient parsing algorithm, this wouldn't matter much, and such a naive implementation might be acceptable. However, I'd like to explore possible techniques which can produce parse trees that mark nodes as errorful, so that we can not only find recoveries from errors, but explicitly demarcate errors. So instead of simply finding the closest correct parses and returning those, as the naive algorithm does, I'd like to get this:

input: x+
output:
  parse 1: (error:expr
             (expr x)
             (op +))
  parse 2: (expr
             (expr x)
             (op +)
             (error:expr))

This (error:T X Y Z) stuff means there was an error trying to parse the sequence X Y Z as a T.

The first parse finds that x is an expr and + is an op, as it should, but the whole thing fails to be an expr because of an error. The second parse successfully parses the whole thing as an expr, but it failed to parse the string after + as an expr. That string was empty, of course, so the sequence after error:expr is also empty.

If we could parse into these sorts of trees, a whole slew of options become available for recovering from these errors. For instance, for each error, we can take the context around the error, together with the contents of the error node, and feed that into a heuristic, extracted from a corpus using machine learning techniques, to determine the most likely fix, and offer that to the user. We could also do some of the edit distance stuff discussed previously, but now on the much smaller localized errors.

Another cool possibility for natural language is that we can even use this to try to learn new things by example. Whenever our parser encounters something it doesn't understand, such as a new verb, we can trigger an error recovery heuristic that tries to figure out what kind of word it is. For instance, consider the sentence "Brutus florbled Caesar a book". Even though you've never encountered the word "florbled" before, the sentence provides enough context for it that we can guess that it's a ditransitive verb, "florble", because of the "-ed" ending, and the context "Brutus _ Caesar a book" that it occurs in. We could even guess at some of the meaning: ditransitive verbs overwhelming tend to be verbs of transfer, and so you probably have this hunch that "florble" means "give".

It might even be possible to interact with the user to check and refine the error recovery, so that Language Engine can actually learn new grammar and meaning just through being used. Having such a system running in the cloud, interacting with thousands of people simultaneously, would allow it to learn extremely rapidly.

I'll discuss some ideas for how to do this in another blog post. I don't have any solid ones yet, only some spotty attempts and some interesting possibilities. But that's the goal anyway.

If you have comments or questions, get it touch. I'm @psygnisfive on Twitter, augur on freenode (in #languagengine and #haskell).