Regular Expressions

posted by Darryl on 8 May 2016

A while back, I was playing with regular expressions and derivatives of datatypes (in the sense of McBride, which you can read about here). At some point it dawned on me that given any particular regular expression, the one-hole contexts into that expression have a tight connection to the states of a non-deterministic finite automata that recognizes the expression.

Note that this notion of differentiation is at the type level, not at the value level as described in Matt Might's blog post.

A repo with the source code in this post is on Github here.

Holey Regular Expressions

First, I should be very precise about what sorts of things count as regular expressions for the sake of this post. I mean to include only true regular expressions here, not "regex" in the common sense. That is, only things that represent regular languages, whereas regex in general includes all sorts of context sensitive features. In particular, I'll define a minimal set of regular expression features, namely, sequencing, alternatives, optionality, and repetition. This is just for simplicity right now, and extending it to other features can be done later.

So, we'll define a data type that represents regular expressions as follows:

infixr 8 :>:
infixr 9 :|:

data Regex
  = Lit Char         -- individual characters are regexes
  | Regex :>: Regex  -- if E and F are regexes, so is EF
  | Regex :|: Regex  -- if E and F are regexes, so is E|F
  | Opt Regex        -- if E is a regex, so is E?
  | Rep Regex        -- if E is a regex, so is E+

We can now look at this and ask, what are the one-hole contexts that can exist for sub-regexes? Clearly there are size: the left and right arguments of Seq and Alt, and the arguments of Opt and Rep. So we need to have a datatype that represents a regular expression with one such subexpression removed. Or, alternatively, we want to talk about the positions inside a regular expression that a subexpression can exist.

Lets consider the regular expression a(b|c+)d, which corresponds to the Regex value Lit 'a' :>: (Lit 'b' :|: Rep (Lit 'c')) :>: Lit 'd'. If we imagine plucking c out from this expression, we'll get an expression with a hole in it: a(b|_+)d and we can of course plug in some other expression, such as ef to get a(b|(ef)+)d.

To represent these, we'll use the tube representation of derivatives, which flattens out the path from the root to the hole into a list of local contexts. So what we'll need is a way of representing a hold in any particular constructor. That is, we need to represent a hole in the left argument position of Seq, the right argument position of Seq, etc.

data LocalRegexContext
  = InSeqL Regex  -- a Seq with the left arg removed
  | InSeqR Regex  -- a Seq with the right arg removed
  | InAltL Regex  -- an Alt with the left arg removed
  | InAltR Regex  -- an Alt with the right arg removed
  | InOpt         -- an Opt with its only arg removed
  | InRep         -- a Rep with its only arg removed

type RegexContext = [LocalRegexContext]

A list of these things corresponds to a regular expression with a single hole, because we can plug in any expression into the first such context, to form a new regular expression, then plug that into the next context, and so on, until we've built the entire new expression.

The Regexs that are arguments to the context constructors are the parts of the regular expression that haven't been removed. That is, if we remove the left expression in EF, we're left with _F where we still have F lying around. So E :>: F would become InSeqL F, for example.

Using the example from earlier of the holey regular expression a(b|_+)d, we would represent this as the list

demoCtx :: RegexContext
demoCtx =
  [ InRep
  , InAltR (Lit 'b')
  , InSeqL (Lit 'd')
  , InSeqR (Lit 'a')

We can define a function to plug a Regexp into a RegexContext relatively easily:

plug :: Regex -> LocalRegexContext -> Regex
plug e (InSeqL f) = e :>: f  -- e in _f is ef
plug e (InSeqR f) = f :>: e  -- e in f_ is fe
plug e (InAltL f) = e :|: f  -- e in _|f is e|f
plug e (InAltR f) = f :|: e  -- e in f|_ is f|e
plug e InOpt      = Opt e    -- e in _? is e?
plug e InRep      = Rep e    -- e in _+ is e+

To plug into a whole list of these, we of course just left fold with plug:

plugContext :: Regex -> RegexContext -> Regex
plugContext = foldl plug

  plugContext e demoContext
  == Lit 'a' :>: (Lit 'b' :|: Rep e) :>: Lit 'd'

Let's now consider a simple algorithm for turning regular expressions into non-deterministic finite automata. I won't give a formal definition, but instead just a visual presentation that transforms a regular expression into a state diagram.

I'll write a regular expression inside double brackets ⟦ and ⟧, which will transform into parts of the NFA that match the regular expression. Directed edges going to and from the brackets will represent the entry and exit points of the regular expression's sub-component of the NFA, connecting to the rest of the NFA that the expression is part of. We'll construct the NFA for a whole regular expression R by beginning with R in brackets with start and finish edges coming from it, like so:

The state that the start edge connects to will be the start state of the NFA, and the state that the finish edge connects to will be the final state of the NFA.

We need to gradually transform ⟦R⟧ into the NFA, so we need rules for doing this. The rules are fairly simple. If R is just a single character x, then we replace it with a state connected to its incoming edge, a state connected to its outgoing edge, and a new edge from the incoming to outgoing states labelled x, like so:

To handle sequencing, we'll transform it into recursive bracketed expressions for the parts of the sequence, again adding states similar to before, all connected in sequence:

Alternation is similarly easy to define. We'll add two states again as before, transform into recursive bracketed expressions connected in parallel to the states:

Optionality introduces states with an edge that can skip the recursive bracketed subexpression:

And repetition introduces states with an edge that lets you go back to the beginning of the subexpression:

Now, this isn't the most efficient representation, because it'll produce lots of states that are unnecessary, but this does ensure that every state corresponds to a subexpression of the whole regular expression. And moreover, each subexpression has two associated states, one that means "about to start the subexpression" and another that means "just finished the subexpression".

As an example, consider the process of building the NFA for a(b|c+)d:

You can notice that each time the translation expands, it introduces one state pointing into the recursions, and one state pointed to by the recursions. These are the "about to state" and "just finished" states, respectively, for the subexpression that was expanded.

States and Derivatives

Each subexpression of a regular expression can be thought of as the subexpression paired with the one-hole context. We need the context because the subexpression exists inside a larger expression, which the context represents. Additionally, each subexpression is associated with two states, a starting state and a finishing state. We can therefore represent the state of a regex by a triple:

data Direction = Starting | Finishing

type RegexState = (Direction, Regex, RegexContext)

Any given RegexState (c,d,e) corresponds to the state that is starting/finishing (d) the subexpression e in the context of the total regular expression plugContext e c.

State Transitions

For any regular expression e, the start state of its NFA is (Starting,e,[]) and the final state is (Finishing,e,[]), which means we have an easy way to jump into an NFA given just a regular expression, and we have an easy way of knowing when the expression has been fully matched.

If this were the only connection between derivatives and NFAs, it'd be interesting but boring. However, we can use the RegexState type to define the transition function for NFAs in a generic fashion. That is to say, we can define a function next :: RegexState -> [RegexState] which can compute the next state of the NFA that the specified state comes from, and moreover, the function is independent of the particular NFA. It's defined as follows:

next :: RegexState -> [RegexState]

-- this group corresponds to the edges out of starting states

next (Starting, Lit x, c)   = [ (Finishing, Lit x, c) ]
next (Starting, e :>: f, c) = [ (Starting, e, InSeqL f : c) ]
next (Starting, e :|: f, c) = [ (Starting, e, InAltL f : c)
                              , (Starting, f, InAltR e : c)
next (Starting, Opt e, c)   = [ (Starting, e, InOpt : c)
                              , (Finishing, Opt e, c)
next (Starting, Rep e, c)   = [ (Starting, e, InRep : c) ]

-- this group corresponds to the edges out of finishing states

next (Finishing, e, InSeqL f : c) = [ (Starting, f, InSeqR e : c) ]
next (Finishing, f, InSeqR e : c) = [ (Finishing, e :>: f, c) ]
next (Finishing, e, InAltL f : c) = [ (Finishing, e :|: f, c) ]
next (Finishing, f, InAltR e : c) = [ (Finishing, e :|: f, c) ]
next (Finishing, e, InOpt : c)    = [ (Finishing, Opt e, c) ]
next (Finishing, e, InRep : c)    = [ (Finishing, Rep e, c)
                                    , (Starting, e, InRep : c)
next _ = []

So, given any regular expression, we can turn it into the initial state of its NFA with no effort, and then transition around its NFA using next. We know when the NFA has finished, because we can just inspect the states to look for the final state.

Generating Strings

We wouldn't want to use the next function to actually transition around a regular expression for matching, however, because it's extremely inefficient, but it's perfectly reasonable for generating random strings that match a given regular expression. Each transition gives a list of next states, and whenever there's more than one, as for alternation and repetition, we can make a choice of which path to choose.

Let's define such a function. Rather than using actual randomness, we'll just conventionally chose to explore the first state in the list returned by next. This guarantees that we take left branches of alternation, and to always finish a repetition (thus avoiding loops).

generateFirst :: Regex -> String
generateFirst e = go (Starting, e, [])
    go :: RegexState -> String
    go s = case next s of
      [] -> ""
      [ s'@(Finishing, Lit x, c) ] -> x : go s'
      s':_ -> go s'

Using our trusty friend a(b|c+)d with generate, we get the expected:

generateFirst (Lit 'a' :>: (Lit 'b' :>: Rep (Lit 'c')) :>: Lit 'd')
== "abd"

The go function above is where the bulk of the generation happens. It steps the NFA to the next state, one state at a time, and if that state happens to be finishing a literal character expression, it sticks that character on the front of the rest of the NFA's output.

Unfortunately this is really inefficient, and won't generalize. Let's ignore for a moment the fact that we're picking the first next state and pretend that somehow we explore each next state. If we wanted to generate a string that's 10 characters long, and we use the above technique, we'd need to keep stepping until at some point we've generated 10 characters, and many of the states we end up in after a single transition won't be ones that have emitted a character.

It would be nicer if we instead had a function emit which transitioned the NFA as far as necessary to emit exactly one character. Then to get 10 characters, we just repeatedly apply the emit function 10 times. Let's define such a function.

emit :: RegexState -> [(Char,RegexState)]
emit s =
  do s' <- next s
     case s' of
       (Finishing, Lit x, _) -> [ (x,s') ]
       _ -> emit s'

Next, we'll need a function that's will try to run an NFA to its final state without emitting anything. This lets us know if the NFA can complete from its current state by making only empty transitions.

finishesTrivially :: RegexState -> Bool
finishesTrivially (Starting, Lit _, _) = False
finishesTrivially (Starting, e :>: f, c) =
  finishesTrivially (Starting, e, InSeqL f : c)
finishesTrivially (Starting, e :|: f, c) =
  finishesTrivially (Starting, e, InAltL f : c)
finishesTrivially (Starting, Opt e, c) =
  finishesTrivially (Finishing, Opt e, c)
finishesTrivially (Starting, Rep e, c) =
  finishesTrivially (Starting, e, InRep : c)
finishesTrivially (Finishing, _, []) = True
finishesTrivially (Finishing, e, InSeqL f : c) =
  finishesTrivially (Starting, f, InSeqR e : c)
finishesTrivially (Finishing, f, InSeqR e : c) =
  finishesTrivially (Finishing, e :>: f, c)
finishesTrivially (Finishing, e, InAltL f : c) =
  finishesTrivially (Finishing, e :>: f, c)
finishesTrivially (Finishing, f, InAltR e : c) =
  finishesTrivially (Finishing, e :>: f, c)
finishesTrivially (Finishing, e, InOpt : c) =
  finishesTrivially (Finishing, Opt e, c)
finishesTrivially (Finishing, e, InRep : c) =
  finishesTrivially (Finishing, Rep e, c)

Finally, we can generate by repeatedly stepping until we get to a state that can finish trivially:

generateN :: Int -> Regex -> [String]
generateN n e = go n [("", (Starting, e, []))]
    go :: Int -> [(String,RegexState)] -> [String]
    go 0 ss = [ reverse rcs | (rcs,s) <- ss, finishesTrivially s ]
    go n ss = let ss' = do
                    (rcs,s) <- ss
                    (c,s') <- emit s
                    return (c:rcs, s')
              in go (n-1) ss'

Testing this out on the expression a(b|c+)d for lengths 1 and 2, we predictably get nothing, for length 3 we get "abd" and "acd", for length 4 we get "accd", and so on.

Recognizing Strings

If we have an operation like emit, we can also do recognition of strings relatively easily. We simply emit characters, and get rid of all the states that didn't result from emitting the next character in the match string. This is the Thompson NFA algorithm, which you can read about here.

To start, let's define a single step consumer that couples emit with consumption of an input character:

consume :: Char -> RegexState -> [RegexState]
consume c s = [ s' | (x,s') <- emit s, c == x ]

What we'd like to do, then, is repeatedly consume the next character off a string until either the string is empty and one of the current states can finish trivially, in which case we've succeeded, or otherwise we've failed. So then:

recognize :: String -> Regex -> Bool
recognize cs e = go cs [(Starting, e, [])]
    go :: String -> [RegexState] -> Bool
    go "" ss = any finishesTrivially ss
    go (c:cs) ss = go cs [ s' | s <- ss, s' <- consume c s

This now gives us precisely what we want. If we test it on our old trust friend a(b|c+)d and the matching strings abd, acd, accd, and acccd, we get true for all of them, and if we try it on the non-matching strings abbd and efg, we get false.

Wrap Up

In an eventual future blog post, I plan to explore a general framework for encoding this approach that will make it easier to add new features to the matched expressions. The goal will be to be able to represent things such as capture groups and back references. There are also some interesting possibilities such as representing finite state transducers instead of just recognizers.

If you have comments or questions, get it touch. I'm @psygnisfive on Twitter, augur on freenode (in #languagengine and #haskell).


The keen reader may be wondering precisely where the differentiation is. Using a the functor algebra perspective on types, we define a functor that represents the shape of the regular expression's constructors as follows:

data RegexF r
  = LitF Char
  | SeqF r r
  | AltF r r
  | OptF r
  | RepF r

The type Regex is then isomorphic to the type μ RegexF aka Fix RegexF. Using the standard technique for taking the derivative of a functor, as described in McBride's work, we get ∂ RegexF aka RegexF':

data RegexF' r
  = InSeqL r
  | InSeqR r
  | InAltL r
  | InAltR r
  | InOpt
  | InRep

Evaluating the derivative at Regex itself, we get RegexF' Regex, which is exactly the type LocalRegexContext but with a different name.

If we were to use generic operators, we could instead define RegexF as follows:

type RegexF = K Char :+ (X :* X) :+ (X :* X) :+ X :+ X

where K Char is a constant, X is the argument position of the functor, (:+) is functor sum, and (:*) is functor product. This corresponds to the mathematical equation f = c + x*x + x*x + x + x. The derivative of f is of course ∂f = f' = x + x + 1 + 1, which corresponds to the type

type RegexF' = X :+ X :+ X :+ X :+ One :+ One

where One is the type with exactly one element.

This is exactly the generic operator representation of Regex'.

If this doesn't exactly make sense, I highly recommend reading McBride's work, particularly ∂ for Data: Differentiating Data Structures. Alternatively, with less formal math, but much better intuitions, this blog post by Pavel Panchekha.