Abstract Chart Parsing

posted by Darryl on 17 July 2015

About a year ago I posted a tutorial on Github about the chart parser I'm using for Language Engine's linguistic parsing. In the intervening year, I've rethought parts of it, and come up with a new, more general, abstract approach to chart parsing that I'm going to call Abstract Chart Parsing, because it abstracts over the details of the chart and the rules in very useful ways. An alternative name, owing to its design, might be State Machine Chart Parsing.

To start, I want to discuss one major drawback to the approach in that tutorial. If we look at the definition of the Rule type, it goes as follows:

data Rule a
  = Rule
    { arity :: Int
    , matcher :: [ChartItem a] -> Maybe (ChartItem a)

This rule structure codes up rules of the form X → Y Z ... where the arity is the number of symbols on the right hand side, and the matcher function is a test that either fails, or if it succeeds, returns the new node (i.e. the left hand side).

This kind of approach has the benefit of going beyond just bare symbols, so that the information in the nodes can be rich, and might even involve unification, etc. but it has a major drawback owing to the use of an arity, which is that there's no way for this kind of rule to short circuit its application. That is to say, the implementation will always pull off arity many sequential edges in the chart before giving them to the rule.

If the rule fails because the very first edge fails, well, you won't find out until all the edges have been read into the rule's matcher function. If the chart has a lot of edges, this could mean a combinatorial explosion, only to fail on all of them. It'd be good to be able to feed edges to the rule one at a time and let it decide if and how to continue.

In addition to this implementation error, I had been operating with a conceptual limitation. It might not have come across in the tutorial or the code, tho I expect most people who read it will also arrive at the same construal. That limitation was: I was viewing the chart type as really and truly representing the chart. Now, in some sense this is exactly what it does — the chart in a chart parser can indeed be seen as a directed acyclic graph and the type Chart is exactly the type you'd normally use to represent DAGs in Haskell.

However, this is actually not the most general view of what Chart is doing as a type. The reason is that normally, charts in chart parsing are linearly structured objects — there's a left-right orientation to them because parsing has linear left-right structure. But it turns out that there are applications where you actually want a set, not a list, such as this paper by Martin Kay called Chart Generation. In this paper, Kay uses a chart parser with a set chart to parse from semantic representations into syntax trees that can be flattened into source text.

So what does this have to do with the Chart type I used? Well, it turns out that there's no actual reason to view it as representing linear structure at all. All it represents is linear (non-deterministic) access to data, that is, accessing elements in sequence. This doesn't mean the elements themselves are ordered, just that your access to them is. And this changes the perspective entirely, because an efficient implementation of Kay's idea essentially relies on controlling access to elements of the set-structured chart in a particular way.

So the new view of chart parsing is as a pair of an emitter state machine, which emits symbols according to some internal logic that you don't see (corresponding to the chart), and a consumer state machine which consumes symbols and possibly yields a symbol (corresponding to the rules of the grammar). Parsing involves plugging emitters into consumers and letting them run until they all halt, either through failure to emit, failure to consume, or finish consuming.

The DAG structure of the Chart type corresponds precisely to the state transitions and symbol emission of a non-deterministic automaton, and we can take advantage of Haskell's laziness to allow us to build up the structure of such a state machine incrementally, and only as needed.

So let's look at the Haskell code. It's not too long — 23 significant lines of code, if you ignore empty lines for readability, which is a nice little digestible unit. First up, emitters:

newtype Emitter a = Emitter [(a, Emitter a)]

An Emitter a is just some list of as paired with emitters that come after them. So an emitter is a node in a state machine, viewed as the sum total of its outward edges and all their future behavior.

For convenience, it's useful to be able to take a list of emitters and fuse them together into one emitter that behaves like the disjunction of them. That is to say, we take some set of nodes and make a new node that non-deterministically acts like all of the nodes.

combineEmitters :: [Emitter a] -> Emitter a
combineEmitters []
  = Emitter []
combineEmitters (Emitter xs : es)
  = let Emitter ys = combineEmitters es
    in Emitter (xs ++ ys)

We also will want to have a way of asking an emitter for all of its outward edges that halt immediately after transitioning. In a linear chart, like for natural language and programming language syntax, these edges span the whole of the input and constitute successful parses.

spans :: Emitter a -> [a]
spans (Emitter xs) = [ x | (x, Emitter []) <- xs ]

Now we move on to consumers. A Consumer a is a state machine that reads in characters one at a time, and can stop by either failing (when a symbol, or sequence of symbols, is not accepted by the machine) or succeeding, whereupon it emits a single symbol.

data Consumer a = Fail | Succeed a | Consume (a -> Consumer a)

The failure case allows rules-as-consumers to fail as soon as enough input has been read, thus addressing the earlier concern about combinatorial explosions for no good reason. The success case is the returned symbol that will eventually get put back onto the chart.

Next, we can plug an emitter and a consumer together and running them until either the consumer stops, or the emitter stops. This process is non-deterministic, because the emitter is non-deterministic, and so yields any number of symbols, together with the remainder of the emitter yet to be consumed.

step :: Emitter a -> Consumer a -> [(a, Emitter a)]
step e            (Succeed x) = return (x,e)
step e            Fail        = []
step (Emitter xs) (Consume f) = do (x,e) <- xs
                                   step e (f x)

At this point, we have a way of taking a chart-as-emitter, and running a rule-as-consume on it, to see if the rule applies, and if so, to get back the symbol to label a new chart edge with. Our next step is to plug a whole collection of consumers into a single chart, which is to say, to take a whole grammar and apply it once to see if anything in the chart as it is at the moment can be used to build new nodes in a parse tree.

next :: [Consumer a] -> Emitter a -> [Emitter a]
next cs e = [ Emitter [(x,e')] | (x,e') <- concat (map (step e) cs) ]

If we stack this process on itself, so that the new edges of one moment become the inputs of the next, until no more new edges are added, we can saturate any given chart, so that all possible new edges are added.

saturate :: [Consumer a] -> Emitter a -> Emitter a
saturate cs e = combineEmitters (e : map (saturate cs) (next cs e))

At this point, everything that could be added to the chart has been. We now have a new chart that has as full of edges as it will ever be.

Next, we need to be able to extend the chart. Any given chart represents only some initial portion of the input to be parsed (possibly a complete input, i.e. the maximal initial portion, but also anything shorter), so we need to be able to read in more of the input incrementally. We do that for a single symbol by making a new chart that emits only that symbol, and then transitions to the given chart. Of course, we want the new chart to have all the possible new edges we can add, so we saturate.

addSymbol :: [Consumer a] -> Emitter a -> a -> Emitter a
addSymbol cs e x = saturate cs (Emitter [(x,e)])

We can add multiple symbols one at a time just by taking a list of symbols and folding over it.

addSymbols :: [Consumer a] -> Emitter a -> [a] -> Emitter a
addSymbols cs = foldl' (addSymbol cs)

And finally, recognition happens just by adding all our input symbols to the empty chart, and asking for the spanning edges of the resulting chart.

recognize :: [Consumer a] -> [a] -> [a]
recognize cs xs = spans (addSymbols cs (Emitter []) xs)

A number of convenient tools can be made for dealing with simple context free grammars and with parse trees. As it stands, it looks like this is just a recognizer/acceptor — either it succeeds with some symbols that span the chart or it fails with none, it doesn't build a parse tree — but because it's polymorphic over the type of the symbols, we can actually label the edges of the chart with parse trees, and use our consumers/rules to build them appropriately. So we have dual purpose code.

Actually it's multi-purpose, because the new tree can be arbitrarily transformed from the inputs, and really the symbols can be anything at all, including computed values (so that this could be used to evaluate an expression as it parses).

It should be noted that rules-as-consumers are backwards, so that the first symbol it reads is the rightmost symbol. So when you're using these, you have to reverse them in your head. Or start reading production rules like Hebrew, from right to left.

If you have comments or questions, get it touch. I'm @psygnisfive on Twitter, augur on freenode (in #languagengine and #haskell).