Precedence with Haskell

posted by Darryl on 17 July 2015

I want to summarize an approach to precedence (for pretty printing and parsing) that I've recently come up with in part of a series of blog posts I'm writing. It's useful enough, I think, to separate out into it's own post, because precedence is such a constant thorn in people's sides.

The approach I'm going to outline here is, as far as I know, novel. It has normal precedence numbers as a special case, but as we'll see, it turns out to be more general. We'll take as our starting point the following grammar in fake-BNF, and an associated data type, for mathematical expressions:

e ::= v | e + e | e - e | e * e
v ::= 0 | 1 | ...
data Exp = Lit Integer | Plus Exp Exp | Minus Exp Exp | Times Exp Exp

There is a common way of expressing when parentheses are required, which is to simply give examples of when they can be omitted. For example, we might say that X + Y + Z means X + (Y + Z) for any choice of X, Y, and Z. What's going on here, as I see it, is that we're implicitly talking about syntactic locations — in this case, the position of the right argument to + — and saying that some expressions can be deparenthesized in certain locations.

Let's state explicitly what the parenthesizations are for our little language. We'll do it in the form of a table, just to keep it all tidy. The header row will have the general shape of an expression of interest, with an underscore for the location we're considering. The rows will correspond to each kind of expression, and we'll put a ✓ in a cell if the row's expression can appear in the column's location without parens.

_ + e e + _ _ - e e - _ _ * e e * _
v
e + e
e - e
e * e

This expresses the idea that + is right associative (X + Y + Z = X + (Y + Z)), that - is left associative (X - Y - Z = (X - Y) - Z) and that it + and - interact so that X - Y + Z = (X - Y) + Z and also X + Y - Z = (X + Y) - Z. It also expresses that * is left associative (X * Y * Z = (X * Y) * Z) and that it binds tighter than + and - as usual.

The associativity of + is chosen just to make this example interesting later on. If by + we mean addition, then the choice is irrelevant, but in general the choice matters, so I choose right associativity to make for an interesting example.

If we abstract away from operators a little bit, we might instead phrase the idea more generally as, for example, X + Y means X + (Y) whenever Y is in the set of expressions that can omit parens as the right arg of +. This is almost tautological, until we imagine that these facts are coded as data.

What we'll do now, to construct a pretty printer along these lines, is as follows: firstly, we'll define a data type of syntactic locations, including one location called Root, which represents when the location inside no other locations (i.e. at the root of an abstract syntax tree). Secondly, we'll define a function for pretty printing that is parameterized by the position of the term being pretty printed. As a first pass, I'm going to steal the general structure from Andrej Bauer's plzoo pretty printer, but we're going to aim at building something a little cleaner.

The code

So here's our data type of locations:

data Loc = Root
         | PlusL  | PlusR
         | MinusL | MinusR
         | TimesL | TimesR
  deriving (Eq)

Nothing magical about that, hopefully. But now we're going to replicate Bauer's pretty printing function, or at least the part he called to_str. We will not locations from the case expression tho, by analogy to Bauer's. Instead we will return lists of locations that represent where the given expression may be de-parenthesized. To make it clean, we'll have two functions, prettyLoc, which corresponds to Bauer's to_str, and prettyAux which is just a useful abstraction for keeping the code clean.

prettyAux :: Exp -> ([Loc],String)
prettyAux (Lit n)
  = ( [Root,PlusL,PlusR,MinusL,MinusR,TimesL,TimesR]
    , show n
    )
prettyAux (Plus l r)
  = ( [Root,PlusR,MinusL]
    , prettyLoc PlusL l ++ " + " ++ prettyLoc PlusR r
    )
prettyAux (Minus l r)
  = ( [Root,PlusL,MinusL]
    , prettyLoc MinusL l ++ " - " ++ prettyLoc MinusR r
    )
prettyAux (Times l r)
  = ( [Root,PlusL,PlusR,MinusL,MinusR,TimesL]
    , prettyLoc TimesL l ++ " * " ++ prettyLoc TimesR r
    )

prettyLoc :: Loc -> Exp -> String
prettyLoc loc e = let (locs, str) = prettyAux e
                   in if loc `elem` locs
                      then str
                      else "(" ++ str ++ ")"

The prettyAux function simply calls prettyLoc on the subexpressions with their locations. For instance, the Plus l r case calls prettyLoc PlusL l because the subexpression l is in the PlusL location, and so forth. It also returns, for each constructor, the list of locations where its parens can be omitted. This list is exactly the columns in the table from earlier that have ✓s in them. So you'll notice that the Minus l r case returns [Root,PlusL,MinusR] and the table earlier had ✓'s for the e - e row in the columns _ + e and e - _. The Root location is included for all of these of course. We can abstract this out later using Maybe, but for now I put it in because it reads better.

Our prettyLoc function just does one simple thing: it calls prettyAux and then asks if the input location (i.e. the location in the whole expression that prettyLoc was called from) is among the locations that are given as places the expression can omit parens. If it is, then we simply return the string from prettyAux, but if it's not, then we add parens.

To pretty print in general, we simply define

prettyPrint :: Exp -> String
prettyPrint = prettyLoc Root

As a type class

At this point, it makes sense to abstract out as much as we can into a type class:

class Eq l => Pretty a l | a -> l where
  prettyAux :: a -> ([l], String)

prettyLoc :: Pretty a l => Maybe l -> a -> String
prettyLoc Nothing e
  = snd (prettyAux e)
prettyLoc (Just loc) e
  = let (locs, str) = prettyAux e
    in if loc `elem` locs
       then str
       else "(" ++ str ++ ")"

prettyPrint :: Pretty a l => a -> String
prettyPrint = prettyLoc Nothing

Here we'll pretend that Loc above didn't have Root — notice we're using Maybe for prettyLoc, so that the root node is represented by Nothing for all possibly location types.

The instance for Exp is

instance Pretty Exp Loc where
  prettyAux (Lit n)
    = ( [PlusL,PlusR,MinusL,MinusR,TimesL,TimesR]
      , show n
      )
  prettyAux (Plus l r)
    = ( [PlusR,MinusL]
      , prettyLoc (Just PlusL) l ++ " + " ++ prettyLoc (Just PlusR) r
      )
  prettyAux (Minus l r)
    = ( [PlusL,MinusL]
      , prettyLoc (Just MinusL) l ++ " - " ++ prettyLoc (Just MinusR) r
      )
  prettyAux (Times l r)
    = ( [PlusL,PlusR,MinusL,MinusR,TimesL]
      , prettyLoc (Just TimesL) l ++ " * " ++ prettyLoc (Just TimesR) r
      )

Possible further generalizations

I'm inclined to say that we can go further still. Obviously we can abstract out the list of locations into its own type class. But the more interested part is the recursive use of prettyLoc. There's some kind of algebra-like thing lurking here. If we had a type ExpF like so:

data ExpF r = Lit Int | Plus r r | Minus r r | Times r r

Then the recursive cases are just the obvious fmap, only not quite because the function that's getting mapped is parameterized by the position of the recursive use. Perhaps there's some generalization of functor algebras that can be made here, as this isn't a new thing — mapping with an index or key is commonplace.

Even without being able to do this, however, it seems plausible that this could be turned into a derivable (collection of) type classes.

Precedence numbers

I mentioned in the intro that this was a generalization of precedence numbering, but it's not immediately obvious how. First, let's consider why some special cases of this can be coded as numbers. You'll notice that the lists of locations that prettyAux returns have some overlap. These lists represent sets of location names, so of course there is often going to be some overlap, as there are 2|Loc| subsets we could've chosen from.

Occasionally, we will pick location lists that are totally ordered by the subset relation, so that for any two expressions X and Y, the locations of X are a subset of the locations of Y, or vice versa, or the same, but never are they ordered relative to one another. In exactly these situations we can assign numbers to each unique subset, and recover the precedence numbering that Bauer gives in his plzoo example.

However, not all parenthesization rules are total orderings. In fact, the example one given above is not, and there's no way to assign it a precedence numbering. This is because in some contexts, e - e can be deparenthesized while e + e cannot (namely, in the location _ + e), while in some other contexts, e + e can be deparenthesized while e - e cannot (namely, in the location e + _). Sometimes + should have higher precedence and sometimes - should, and it depends on the context of use.

So using explicit location names, rather than precedence numbers, is a strictly more expressive variant. It's not clear to me whether precedence numbers + associativity declarations (like Haskell's fixity) is equivalent to this technique, but I suspect not, because that only applies to infix operators (tho these two techniques might be equivalent in that domain).

Non-ambiguity condition

Because this whole thing is intended to explain when it's possible to omit parentheses, the choices above are all subject to the condition that they don't produce ambiguous results. We can test this by looking at all of the cells in the above table that have ✓s and inserting the expressions into the locations. If any two produce the same string, the rules are ambiguous and won't be usable.

If we do this with the above table, we get the strings

v + e
e + v
v - e
e - v
v * e
e * v
e + e + e
e + e - e
e - e - e
e * e + e
e + e * e
e * e - e
e - e * e
e * e * e

which you'll notice are all distinct. Therefore the above omission rules are usable, and there are no ambiguities.

Parsing

I also mentioned earlier that this technique is useful for parsing. The separation of the syntax into expressions and locations can be directly translated into parsers as well. We'll have two major classes of parsers for the above language — expression-specific and location specific. That is to say, each kind of expression will get its own parser (a literal parser, a plus parser, a minus parser, a times parser), and each location will also get its own parser.

To handle optional parenthesization, the location parsers will list all of the expressions that can appear without parens, along with a parens parser to catch the rest.

Here's the full parser for the above language:

symbol_ :: String -> GenParser Char st ()
symbol_ s = do _ <- string s <* spaces
               return ()

tryOrElse :: [GenParser a st b] -> GenParser a st b -> GenParser a st b
tryOrElse e []     = e
tryOrElse e (p:ps) = try p <|> tryOrElse e ps

op :: (Exp -> Exp -> Exp)
   -> String
   -> GenParser Char st Exp
   -> GenParser Char st Exp
   -> GenParser Char st Exp
op f s lp rp = do l <- lp
                  symbol_ s
                  r <- rp
                  return (f l r)

-- each kind of expression
lit   = Lit . read <$> many1 digit
plus  = op Plus "+" plusL plusR
minus = op Minus "-" minusL minusR
times = op Times "*" timesL timesR
parenthesized = between (symbol_ "(") (symbol_ ")") root

-- and each kind of location
plusL  = tryOrElse lit [parenthesized,minus]
plusR  = tryOrElse lit [parenthesized,plus]
minusL = tryOrElse lit [parenthesized,plus,minus,times]
minusR = tryOrElse lit [parenthesized,times]
timesL = tryOrElse lit [parenthesized,times]
timesR = tryOrElse lit [parenthesized]
root   = tryOrElse lit [parenthesized,plus,minus,times]

Actually, we don't want to use this parser as is, because it's left-recursive and will go into a loop, but never mind that for now.

You can see from this that the parser is cleanly segmented into parts that correspond exactly to the various kinds of expressions and locations. The arguments to the numerous tryOrElses are exactly the appropriate expression types for those locations, with parenthesized thrown in: anything can appear there parenthesized, plus also some other things. Note that the alternatives are always given in order of specificity — if we put lit earlier than times, for instance, we might never reach times because lit can succeed and Parsec doesn't backtrack through alternatives. So we just order them correctly to avoid this.

If you have comments or questions, get it touch. I'm @psygnisfive on Twitter, augur on freenode (in #languagengine and #haskell).