Pattern Matching

posted by Darryl on 21 September 2015

Pattern matching is absolutely central to modern functional programming languages, and in the type theoretically informed ones, it forms the basis of all usage of user-defined data. To inspect some piece of data, to extract part of it, etc. is all done by pattern matching. Since it's so central, and so all-permeating, it's a good idea to make it as efficient as possible.

A non-optimized pattern matching implementation that just pairs a pattern with a clause, taking the case expression at face value, will be inefficient because it will often do a lot of redundant work. Consider, for instance, this Haskell expression:

case foo of
  (True,True)  -> bar
  (True,False) -> baz
  (False,y)    -> quux

If we implement this match the foo value to each pattern, we end up looking at the outermost constructor of foo three separate times, each time to see if its the (,) constructor for pairs. But once we've inspected it and know that it's (,) the first time, we never need to look at it again, it's not going to change. Similarly, if we inspect the first element of foo and find that its True, the non-optimized matcher will look at each clause independently, even tho once we find it to be True, we can rule out the third clause entirely.

On option for optimizing these is to remove all the nesting of pattern matches, so that we only ever inspect the outmost constructor. Some languages, including Haskell, if I'm not mistaken, take this approach. It's a code-code transformation rather than a deep implementation strategy. So for example, with this expression, we can transform it to the equivalent

case foo of
  (x,y) -> case x of
    True -> case y of
      True -> bar
      False -> baz
    False -> quux

And now if we use the unoptimized matching algorithm, the resulting code is efficient. Some nuance arises when we try to actually define the transformation, because of expressions like this:

case foo of
  (True,True)  -> bar
  (False,True) -> baz
  (x,y)        -> quux

However, there's another way to implement efficient pattern matching that comes at it from the perspective of treating efficiency as a core component, rather than a code optimization decision. I don't know if this is a better approach, but it's at least an interesting one, so I figured I'd explain it here.

A hint at a solution?

When I was first thinking about this, I got the idea to play around with defining some kind of "fused" pattern that could represent multiple patterns simultaneously. For instance, if we have the natural number patterns Suc Zero (for 1) and Suc (Suc Zero) (for 2), we could somehow splice them together so that the outer Suc constructor has both Zero and Suc Zero as "alternative" daughters. Something vaguely like Suc (Zero `or` Suc Zero), which would match 1 and also 2 (so it means "either 1 or 2"). Or we might have something like Suc (Zero `or` Suc (Suc (Zero `or` Suc (Suc Zero)))) which matches 1, 3, and 5, and so on

We can implement this, and a matcher, pretty easily for natural numbers. Here's an example, without treating data as general trees as we would in an implementation of an actual language:

data Nat
  = Zero
  | Suc Nat

data NatPattern
  = ZeroP
  | SucP NatPattern
  | AnyNatP
  | NatChoiceP NatPattern NatPattern

natMatch :: NatPattern -> Nat -> Bool
natMatch ZeroP Zero
  = True
natMatch (SucP p) (Suc n)
  = natMatch p n
natMatch AnyNatP _
  = True
natMatch (NatChoiceP p p') n
  = natMatch p n || natMatch p' n

This approach seems quite nice, because we can define a function that takes a bunch of patterns that have no use of NatChoiceP and combine them into a minimal patterns with NatChoiceP splicing them together appropriately, so that the following equivalencies hold:

NatChoiceP ZeroP ZeroP ≅ ZeroP
NatChoiceP (SucP p) (SucP p') ≅ SucP (NatChoiceP p p')

However, this idea isn't without its fatal flaws. So far, we've only seen what happens with a simple type like Nat, which only has constructors with 0 or 1 argument, but what do we do for a type that has a constructor with multiple arguments? Consider this type:

data BinTree
  = Leaf
  | Branch BinTree BinTree

data BinTreePattern
  = LeafP
  | BranchP BinTreePattern BinTreePattern
  | AnyBinTreeP
  | BinTreeChoiceP BinTreePattern BinTreePattern

binTreeMatch :: BinTreePattern -> BinTree -> Bool
binTreeMatch LeafP Leaf = True
binTreeMatch (BranchP l r) (Branch l' r')
  = binTreeMatch l l' && binTreeMatch r r'
binTreeMatch AnyBinTreeP _
  = True
binTreeMatch (BinTreeChoiceP p p') t
  = binTreeMatch p t || binTreeMatch p' t

When we have a choice between a Leaf and either a Leaf or a Branch, then the corresponding pattern is easy enough: it's just BinTreeChoiceP, but what about when it's a choice between two Branches?

Perhaps what we want is to have the following equivalencies:

BinTreeChoiceP LeafP LeafP ≅ LeafP
BinTreeChoiceP (BranchP l r) (BranchP l' r')
  ≅ BranchP (BinTreeChoiceP l l') (BinTreeChoiceP r r')

But this is definitely wrong, because the choice of l for the left node always goes with the choice of r for the right node in BinTreeChoiceP (BranchP l r) (BranchP l' r'), and similarly for l' and r'. However, in the "equivalent" BranchP (BinTreeChoiceP l l') (BinTreeChoiceP r r'), the choice of the left and right branches is independent. To use a concrete example:

tree = Branch Leaf (Branch Leaf Leaf)

pattern1 = BinTreeChoiceP
             (BranchP LeafP LeafP)
             (BranchP (BranchP LeafP LeafP) (BranchP LeafP LeafP))

pattern2 = BranchP
                (BranchP LeafP LeafP))
                (BranchP LeafP LeafP))

If we match tree to each pattern, we get binTreeMatch pattern1 tree == False while binTreeMatch pattern2 tree == False. pattern2 will also match the mirror of tree, for a total of 4 matching trees, while pattern1 only matches the two in its choice.

This really is no surprise. There are many ways to view this that make sense of it, but I think the best way is to realize that multiple arguments produce conjunctive patterns (p has to match and p' has to match) whereas choice produces disjunctive patterns (p has to match or p' has to match), and so since conjunction and disjunction don't commute ((a && b) || (c && d) is not the same as (a || c) && (b || d)). In regular expressions we get the same thing: ab|cd is not the same as (a|c)(b|d). There's a deeper reason for all of these, of course (having to do with non-commutation of products and coproducts) but the intuition should be clear now.

In addition to this problem, there is an even bigger, more fatal problem: we're supposed to be doing pattern matching, not just testing if some piece of data is equal to any of a specified collection. Pattern matching requires we form some kind of lookup, mapping elements of the specified collection to the associated clause. So these "choice" patterns are analogous to sets (Set Nat, Set BinTree), while what we actually need is something more like a map (Map Nat Clause, Map BinTree Clause). As it is, once we've found a matching choice, we have no way of saying what the associated clause is.

The better solution

If we take a look at the types NatPattern and BinTreePattern, ignoring choice now, we can notice that any such pattern defines an "upper" portion of data-as-tree. That is to say, patterns are exactly like values, except patterns have an "any" label that indicates anything (of the right type) can go there.

A piece of data matches a pattern when it has an upper portion thats the same as the pattern. Matching involves traversing the pattern and the data at the same time, making the same traversal moves, as determined by the pattern, and when the labels don't match and the moves can't be made, the data doesn't match the pattern.

This provides the key inside we need. We can view matching as simultaneous traversal. Given any pattern, and any fixed traversal order, we can convert the pattern to a list of label names, and then traverse the data with the same order, checking each pattern name in turn. So, for instance, the pattern SucP (SucP ZeroP) would instead be given as, say, [SucPL,SucPL,ZeroPL], using in-order traversal, and similarly, BranchP LeafP (BranchP LeafP LeafP) would be instead [BranchPL,LeafPL,BranchPL,LeafPL,LeafPL].

While a list of constructor names isn't quite the solution yet, we can see how matching would work in this alternative formulation of patterns:

data BinTreePatternCons
  = LeafC
  | BranchC
  | AnyBinTreeC

type BinTreePatternC = [BinTreePatternCons]

binTreeMatchC :: BinTreePatternC -> BinTree -> Bool
binTreeMatchC p t = go p [t]
    go [] [] = True
    go (LeafC:p) (Leaf:ts)
      = go p ts
    go (BranchC:p) (Branch l r:ts)
      = go p (l:r:ts)
    go (AnyBinTreeC:p) (t:ts)
      = go p ts
    go _ _ = False

This uses a typical tail-recursive traversal of a tree on a stack, which is quite easy to implement and which produces an in-order traversal.

As I said tho, this isn't the final way we're going to do this, because we still haven't taken choices into account, nor that clauses have to pair with them. But we at least now have a sense of the process. To handle choices, we need to be able to say at any given point, there are some choices of constructors to read next, not just one, and for each option we need to give a traversal. So a traversal becomes a kind of map, from constructor names to traversals. But it can't be just a map, because at some point the traversal has to end, so we'll have two kinds of traversals:

data Traversal a
  = Done
  | Choice (Map a (Traversal a)

binTreeMatchT :: Traversal BinTreePatternCons -> BinTree -> Bool
binTreeMatchT p t = go p [t]
    go Done [] = True
    go (Choice m) (Leaf:ts)
      = goCon m LeafC ts ts
    go (Choice m) (Branch l r:ts)
      = goCon m BranchC (l:r:ts) ts
    go _ _ = False
    goCon m c ts ts'
      = case M.lookup c m of
          Just p | go p ts
            -> True
          _ -> case M.lookup AnyBinTreeC m of
            Just p  -> go p ts'
            Nothing -> False

The auxiliary function go is relatively straight forward: if the pattern traversal wants to end, and there are no more trees to decompose, we've succeeded. On the other hand, if we have a choice, and we can proceed by trying to decompose a tree with a specified constructor, we call the goCon auxiliary function to handle that decomposition. Otherwise, we fail.

The auxiliary function goCon is somewhat subtle, however. Because we have variable patterns like AnyBinTreeC, we might find that the pattern doesn't have an option for the specified constructor but it does have a catch-all option. So we need to provide goCon with the stack of trees for both eventualities. For instance, if a tree is Branch l r, we need to proceed by using a stack that has l and r if the pattern can inspects the branch node, but if it can only match with AnyBinTreeC, we don't need to put l and r on the stack.

Furthermore, in the case that the constructor does show up in the pattern, that doesn't guarantee that the associated traversal will succeed. If it doesn't, the pattern match as a whole can still succeed because AnyBinTreeC can be tried as an alternative. For example, consider this case expression from earlier:

case foo of
  (True,True)  -> bar
  (False,True) -> baz
  (x,y)        -> quux

In the first and second cases, the traversals for their patterns can both fail even if a partial match is match. Suppose, for instance, that foo = (True,False). If we go through the traversal of (True,True), we'll succeed in matching the pair constructor and then the first True, and then fail at the second True. There is no path through the traversal that goes pair-true-false or even pair-true-any. However there is a path that goes pair-any-any, namely, the third pattern (x,y). The traversal for this case expression's patterns would look like this, in tree form:

|-- True
|   |
|   |-- True
|-- False
|   |
|   |-- True
|-- x
    |-- y

This is why, in the definition of goCon, even if we successfully find a traversal choice that matches the input data's constructor, we have the guard in Just p | go p ts that requires the whole rest of that traversal to succeed before we consider that choice successful.

If we now try out this matching function with a traversal, it works exactly the same as a the list version. This is to be expected: when the choice traversals have exactly one choice, the traversal is isomorphic to a list. However, if we have 0 or 2+ alternatives, we get a full set of trees to match.

So we've solved the problem of choice in the patterns from earlier, all that remains is to figure out how to pair up the appropriate clauses with their patterns. This is very simple, however. We can augment the Done constructor to carry a Clause, and then augment the binTreeMatchT function to return Maybe Clause instead of a Bool. The change is almost too trivial to bother showing:

data Traversal a
  = Done Clause
  | Choice (M.Map a (Traversal a))
  deriving (Show)

binTreeMatchT :: Traversal BinTreePatternCons -> BinTree -> Maybe Clause
binTreeMatchT p t = go p [t]
    go (Done c) [] = Just c
    go (Choice m) (Leaf:ts)
      = goCon m LeafC ts ts
    go (Choice m) (Branch l r:ts)
      = goCon m BranchC (l:r:ts) ts
    go _ _ = Nothing
    goCon m c ts ts'
      = case M.lookup c m of
          Just p | Just c <- go p ts
            -> Just c
          _ -> case M.lookup AnyBinTreeC m of
            Just p  -> go p ts'
            Nothing -> Nothing

Lastly, it would be useful to have a way to combine patterns-as-traversals, so that we can take a list of traversals (one for each pattern+clause) and turn them into the appropriate total traversal by combining them one at a time. This is pretty easy:

combine :: Ord a => Traversal a -> Traversal a -> Traversal a
combine (Done c)   (Done c')   = Done c
combine (Choice m) (Choice m') = Choice (M.unionWith combine m m')


Now, obviously a real functional programming language won't be using this BinTree type for data, it'll something more like RoseTree String, where RoseTree is defined as

data RoseTree a = RoseTree a [RoseTree a]

But the matching function for this is basically the same, if not simpler. Implementation is left as an exercise to the reader.

Outside of the domain of implementing a functional programming language, this way of viewing tree structured data has some interesting manifestations. We can step back from the pattern matching and look instead at just data in general encoded this way, and it turns out that in Haskell, we can define a data family that can encode various types. I call it Lin (because this represents a way to linearize or serialize data), and it can come with an encoding and decoding operation:

data family Lin a r

class Linearizable a where
  linearize :: a -> r -> Lin a r
  delinearize :: Lin a r -> (a,r)

The a parameter represents the data type being encoded, and the r parameter represents the "rest" of the data, whatever shows up after the a stuff. On simple data, the definition of Lin is easy, we just wrap up an r in constructors that correspond to the simple data type. For example, on booleans:

data instance Lin Bool r where
  LinBoolTrue :: r -> Lin Bool r
  LinBoolFalse :: r -> Lin Bool r

instance Linearizable Bool where
  linearize True r = LinBoolTrue r
  linearize False r = LinBoolFalse r
  delinearize (LinBoolTrue r) = (True,r)
  delinearize (LinBoolFalse r) = (False,r)

For a simple recursive type like Nat the encoding is also simple, except for the recursive constructors, we just have a recursive argument. The r argument is only present for non-recursive constructors.

data Nat = Zero | Suc Nat

data instance Lin Nat r where
  LinNatZero :: r -> Lin Nat r
  LinNatSuc :: Lin Nat r -> Lin Nat r

instance Linearizable Nat where
  linearize Zero r = LinNatZero r
  linearize (Suc n) r = LinNatSuc (linearize n r)
  delinearize (LinNatZero r) = (Zero,r)
  delinearize (LinNatSuc nr) = let (n,r) = delinearize nr
                               in (Suc n, r)

But the interesting stuff happens with multiple arguments. Like our traversals, we end up representing compound data in a very weird way, where something that is "after" something else is "inside" that something else. The definition of Lin has constructors corresponding to the encoded data type, but because "after" means "inside", when we have multiple arguments, we put the "later" arguments in the r position of the "earlier" arguments. For instance, pairs:

data instance Lin (a,b) r where
  LinPair :: Lin a (Lin b r) -> Lin (a,b) r

instance (Linearizable a, Linearizable b) => Linearizable (a,b) where
  linearize (a,b) r = LinPair (linearize a (linearize b r))
  delinearize (LinPair abr) = let (a,br) = delinearize abr
                                  (b,r) = delinearize br
                              in ((a,b),r)

Or lists:

data instance Lin [a] r where
  LinNil :: r -> Lin [a] r
  LinCons :: Lin a (Lin [a] r) -> Lin [a] r

instance Linearizable a => Linearizable [a] where
  linearize [] r = LinNil r
  linearize (x:xs) r = LinCons (linearize x (linearize xs r))
  delinearize (LinNil r) = ([],r)
  delinearize (LinCons aasr) = let (a,asr) = delinearize aasr
                                   (as,r) = delinearize asr
                               in (a:as,r)

The most interesting aspect of this, however, is not that this can be done. We ought to expect this can be done, since this is just a variation on the traversal theme where instead of using maps and constructor names and so forth, we define this funky Lin data family. No, what's most interesting is the types of the Lin constructors. Let's define some auxiliary types to make things cleaner:

type (~>) f g = forall a. f a -> g a
type (.:) f g a = f (g a)
type Id a = a

The first can be seen as roughly being the arrow for natural transformations, and the second is roughly functor composition. So what happens when we rewrite the Lin constructors with these?

LinBoolTrue  :: Id ~> Lin Bool
LinBoolFalse :: Id ~> Lin Bool

LinBoolZero :: Id ~> Lin Nat
LinBoolSuc  :: Lin Nat ~> Lin Nat

LinBoolPair :: Lin a .: Lin b ~> Lin (a,b)

LinBoolNil  :: Lin [a]
LinBoolCons :: Lin a .: Lin [a] ~> Lin [a]

These look suspiciously like the types (in category theoretic form) of the constructors of the types that are being encoded. Replace functorial arrows ~> correspond to type arrows ->, functorial composition .: corresponds to pairing (,).

I don't quite know what the right categorial treatment of this is, but it seems to be that Lin is a functor from |Set to |Set^|Set that somehow preserves algebras in some way?

If you have comments or questions, get it touch. I'm @psygnisfive on Twitter, augur on freenode (in #languagengine and #haskell).