Grammars as Data Types

posted by Darryl on 4 July 2015

There's an interesting connection between language and computer science in the form of a correspondence between grammars and inductively defined data types. I hope the correspondence isn't too unforeseen, but it's still worth making as explicit as possible, to see some variations.

This connection makes it easy to define grammars (for use in parsing) in parallel with the semantic interpretation rules. The semantic rules translate the parse trees into a representation of the meaning of the parse tree. For some purposes, the meaning representations will be another simple kind of data, like a symbolic proposition, and for others they'll be values or computations.

For natural language, it's often useful to have symbolic logic formulas as the meanings (so that the meaning of the sentence "everyone danced" is the logical formula "∀x. Danced(x)"), while for programming languages we often use values (so that, for example, the meaning of the expression "1 + 2" is the value "3"). The meaning representations we choose will always be determined by what we want to do with them, however.

I'm going to write this in Haskell, using a bunch of phantom type stuff, but a dependently typed language would be even better suited.

A simple grammar

Let's consider the following grammar for mathematical expressions, which is rather common in these sorts of discussions. The notation I'm using is vaguely BNF, just to have a notational distinction for later.

E ::= N
E ::= E + E
E ::= E - E
N ::= 0
N ::= 1
N ::= 2
...

This grammar classifies some strings in various ways, but more importantly, it can be seen as defining a class of parse trees. This grammar represents a rewrite system, however, where a start symbol or node is successfully rewritten by expanding it into new symbols or attaching subnodes. That's what the ::= means — rewrite the left as the right. But we can turn this around and say, from some things like this, build something like this, like so:

N → E
E + E → E
E - E → E
0 → N
1 → N
2 → N
...

When we're dealing with just symbols, we like to distinguish between terminal and non-terminal symbols, and the distinction is somewhat one of abstraction. The symbols E and N have the property of being classes of different things, they're not just symbols, but they're categories of different sorts of things. This contrasts with the symbols 0, 1, 2, +, -, etc. which are particulars things, not classes of things.

The grammar notation in some sense conflates these two, by making them all symbols. So let's force these apart. First, we'll make sure that our grammar is such that any rule's sub-node part (the RHS of the ::= notation, the LHS of the notation) consists of either exact one terminal symbol, or a sequence of non-terminal symbols. The grammar given above doesn't have this property because of the + and - in rules 2 and 3, but we can introduce a new symbol, and a new rule, to make it have this property, like so:

N → E
E Q E → E
0 → N
1 → N
2 → N
...
+ → Q
- → Q

This just guarantees that the next move we're going to make is possible. That move, now, is to replace these terminal symbol rules with a distinct notation. Instead of writing , we'll simply write ::, to indicate not that the symbol is gotten by rewriting some other symbol, but that the symbol is classified by the other symbol:

N → E
E Q E → E
0 :: N
1 :: N
2 :: N
...
+ :: Q
- :: Q

Next, we'll give an explicit symbol for the "sequence of symbols" portion. Since sequences are a kind of pair, we'll use the tuple notation of Haskell, so that a sequence A B C becomes (A,B,C). After all, the sequence A B C acts to classify three items, in order, as an A, a B, and a C, and that's nothing but a 3-tuple.

N → E
(E,Q,E) → E
0 :: N
1 :: N
2 :: N
...
+ :: Q
- :: Q

Finally, we'll give the arrow rules symbols as well like the terminal symbols. Since we're looking to think of these grammars as classifying parse trees, these rule symbols, and the terminal symbols, constitute the ways of building nodes, and the non-terminal symbols are the labels that the nodes have.

NumRule :: N → E
OpRule :: (E,Q,E) → E
0 :: N
1 :: N
2 :: N
...
+ :: Q
- :: Q

And what do you know, these look like type signatures for constructors in Haskell! They're not entirely valid constructors, but only because the constructo names are no good. We can rename like so:

NumRule :: N → E
OpRule :: (E,Q,E) → E
Zero :: N
One :: N
Two :: N
...
Plus :: Q
Minus :: Q

And now we've got actual type signatures for constructors of three types, E, N, and Q. We can define a data type in Haskell to get these constructors, like so:

-- without GADTs
data E = NumRule N | OpRule (E,Q,E)
data N = Zero | One | Two | ...
data Q = Plus | Minus

-- with GADTs, to make the signatures more obvious
data E where
  NumRule :: N → E
  OpRule :: (E,Q,E) → E

data N where
  Zero :: N
  One :: N
  Two :: N
  ...

data Q where
  Plus :: Q
  Minus :: Q

The expression "1 + 2" becomes the data OpRule (NumRule One, Plus, NumRule Two).

The correspondence isn't just for this grammar, either. Obviously the above transformation, from grammar to type signatures, will work for any context-free grammar, and will produce the data type for all and only the appropriate parse trees, where terminal symbols and rules correspond to constructors, and non-terminal symbols correspond to types.

A bit harder

One common thing that linguists do is affix subsymbols — features, as they're called — to their grammatical categories, so that we don't just have two kinds of noun — singular nouns and plural nouns — as distinct symbols but instead a single symbol with a number feature that distinguishes them. For example, we might have some grammar rules like this:

cat → N[sg]
cats → N[pl]

Where N[sg] is a structured symbol with a feature sg. This allows us to state things like subject-predicate agreement really easily in the rules, like so:

NP[a] VP[a] → S

Where the a is a variable standing for the number feature. The use of the variable on both the subject noun phrase symbol and the verb phrase symbol means that this rule can only be used to produce a noun phrase and a verb phrase that have the same number feature, whatever it may be.

This of course also lets us propagate features up as well, since a noun phrase should be plural whenever its noun is plural, and so on, and some verbs are specific to particular agreements:

N[a] → NP[a]
meow → V[pl]
V[a] → VP[a]

We can translate this into a Haskell GADT really easily by making the feature a phantom type:

-- the phantom types for the features
data Sg where
data Pl where

-- the types for the grammar proper
data N a where
  Cat :: N Sg
  Cats :: N Pl

data NP a where
  JustANoun :: N a -> NP a

data V a where
  Meow :: V Pl

data VP a where
  JustAVerb :: V a → VP a

data S where
  SubjPredAgr :: (NP a, VP a) → S

And if you now try to make a sentence, i.e. an element of type S, you'll find you can only produce this one:

SubjPredAgr (JustANoun Cats, JustAVerb Meow)

That is to say, the sentence "cats meow", but not "cat meow", even tho there is a perfectly good noun Cat. The phantom types code up the features quite nicely and adequately.

More phantom types!

We can take this phantom type nonsense pretty far, in fact. It's unpleasant right now that our rules are split across multiple data types. Instead, it'd be nice to have a single grammar type that represents the rules. We can do this by making the grammatical categories into phantom types as well! We'll move the constructors into a new indexed type like so:

-- the phantom types
data Sg where
data Pl where
data N a where
data NP a where
data V a where
data VP a where
data S where

-- the grammar, represented by a parse tree type
data Tree a where
  Cat :: Tree (N Sg)
  Cats :: Tree (N Pl)
  JustANoun :: Tree (N a) → Tree (NP a)
  Meow :: Tree (V Pl)
  JustAVerb :: Tree (V a) → Tree (VP a)
  SubjPredAgr :: (Tree (NP a), Tree (VP a)) → Tree S

Even tho we've pushed it all into a single type, now, the previous element is still well typed at this new type. All that's changed is the type level.

Deeper into types

One popular alternative to context-free grammars is the equivalent framework of categorial grammars, where instead of lots of rules that define constructors, we instead have some very simple rules, but with more complex types. Typically, this means we have function types of some sort, with a function application rule. Let's look again at our Haskell definitions for the math expression trees, updated to use the parse tree variant:

data E where
data N where
data Q where

data Tree a where
  NumRule :: Tree N → Tree E
  OpRule :: (Tree E, Tree Q, Tree E) → Tree E
  Zero :: Tree N
  One :: Tree N
  Two :: Tree N
  ...
  Plus :: Tree Q
  Minus :: Tree Q

You'll notice that we have this type of infix operators, Q, which is used in the OpRule signature to represent "operator application". But operators are a sort of function, just one that happens to appear in between two arguments. Categorial grammars give us two function types — arg on the left and arg on the right functions — and with this, we can say instead that operators are functions that take an argument to the left and an argument to the right.

To do this, we'll get rid of the Q type, and add two new types for each kind of function type:

-- the old phantom types
data E where
data N where

-- the new phantom for functions with the arg to the left
data a :\: b where

-- the new phantom for functions with the arg to the right
data a :/: b where

We'll now remove OpRule, and replace it with rules for forward and backward function application. We'll also give a new type to the operators Plus and Minus

data Tree a where
  ArgFun :: (Tree a, Tree (a :\: b)) → Tree b
  FunArg :: (Tree (b :/: a), Tree a) → Tree b
  NumRule :: Tree N → Tree E
  Zero :: Tree N
  One :: Tree N
  Two :: Tree N
  ...
  Plus :: Tree (N :\: (E :/: N))
  Minus :: Tree (N :\: (E :/: N))

In this new system, the expression "1 + 2" is now represented by the data
FunArg (ArgFun (NumRule One) Plus) (NumRule Two). The operator Plus first combines with the number to its left using ArgFun, and then the result combines with the number to its right using FunArg.

Better versions still

One observation you might have about the above code is that the things which constitute grammatical categories and grammatical features are themselves just types and they're scattered around, and the grammar rules work for any types, not just ones that represent grammatical categories and features. If we use the Haskell data kind and kind signature extensions, however, we can go further, by having a type of grammatical categories, a type of grammatical features, and by explicitly indexing the trees by data. Here's the little fragment of English:

-- honest to goodness grammatical number features
data Num where
  Sg :: Num
  Pl :: Num

-- honest to goodness grammatical categories
data Cat where
  N :: Num → Cat
  NP :: Num → Cat
  V :: Num → Cat
  VP :: Num → Cat
  S :: Cat

-- the grammar, represented by a parse tree type
data Tree :: Cat → * where
  Cat :: Tree (N Sg)
  Cats :: Tree (N Pl)
  JustANoun :: Tree (N a) → Tree (NP a)
  Meow :: Tree (V Pl)
  JustAVerb :: Tree (V a) → Tree (VP a)
  SubjPredAgr :: (Tree (NP a), Tree (VP a)) → Tree S

And here's the better version of the math expression trees:

data Cat where
  E :: Cat
  N :: Cat
  (:\:) :: Cat → Cat → Cat
  (:/:) :: Cat → Cat → Cat

data Tree :: Cat → * where
  ArgFun :: (Tree a, Tree (a :\: b)) → Tree b
  FunArg :: (Tree (b :/: a), Tree a) → Tree b
  NumRule :: Tree N → Tree E
  Zero :: Tree N
  One :: Tree N
  Two :: Tree N
  ...
  Plus :: Tree (N :\: (E :/: N))
  Minus :: Tree (N :\: (E :/: N))

Conveniently, the grammar doesn't change at all. The only change is to the categories and features.

Another modification we can make is to extract all of the rules that correspond to terminal nodes/words into their own data type like so:

data Num where
  Sg :: Num
  Pl :: Num

data Cat where
  N :: Num → Cat
  NP :: Num → Cat
  V :: Num → Cat
  VP :: Num → Cat
  S :: Cat

-- just the grammar, now
data Tree :: Cat → * where
  AWord :: Word a → Tree a
  JustANoun :: Tree (N a) → Tree (NP a)
  JustAVerb :: Tree (V a) → Tree (VP a)
  SubjPredAgr :: (Tree (NP a), Tree (VP a)) → Tree S

-- a type for the lexicon
data Word :: Cat → * where
  Cat :: Word (N Sg)
  Cats :: Word (N Pl)
  Meow :: Word (V Pl)
data Cat where
  E :: Cat
  N :: Cat
  (:\:) :: Cat → Cat → Cat
  (:/:) :: Cat → Cat → Cat

data Tree :: Cat → * where
  Word :: Word a → Tree a
  ArgFun :: (Tree a, Tree (a :\: b)) → Tree b
  FunArg :: (Tree (b :/: a), Tree a) → Tree b
  NumRule :: Tree N → Tree E

data Word :: Cat → * where
  Zero :: Word N
  One :: Word N
  Two :: Word N
  ...
  Plus :: Word (N :\: (E :/: N))
  Minus :: Word (N :\: (E :/: N))

Separating out the lexicon from the grammar has its advantages, especially for natural language, because we're more likely to change the lexicon by adding new words than we are to change the grammar, improving modularity.

If we want to improve modularity even further, we can use data families, instead of just data types, to allow all of these types to be extended in separate modules pertaining to different domains of interest (such as a module for the sub-language of place names, a module for the sub-language of numerical expressions, etc).

Semantics?

This indexing approach also works really well for adding a semantics to our system in a way that ensures everything makes sense. Let's look at the math example of this, but the English version works just as well (tho with fancier stuff going on).

What we'll do is define a type family that maps grammatical categories to Haskell types, like so:

type family Interp a

type instance Interp E = Int
type instance Interp N = Int
type instance (a :\: b) = Interp a → Interp b
type instance (b :/: a) = Interp a → Interp b

And now what we'll do is define an evaluation function that maps parse trees to their corresponding Haskell values, ensuring that the types make sense, and a similar function for words:

evaluate :: Tree a → Interp a
evaluate (AWord w) = lex w
evaluate (ArgFun x f) = evaluate f (evaluate x)
evaluate (FunArg f x) = evaluate f (evaluate x)
evaluate (NumRule t) = evaluate t

lex :: Word a → Interp a
lex Zero = 0
lex One = 1
lex Two = 2
...
lex Plus = (+)
lex Minus = (-)

If you're familiar with deeply embedded domain specific languages in Haskell, you'll find all of this familiar, especially the mathematical expression one. The applications to natural language are the same, but with a different DSL that you're embedding.

If you have comments or questions, get it touch. I'm @psygnisfive on Twitter, augur on freenode (in #languagengine and #haskell).