Efficient Subset
Matching with Tries

posted by Darryl on 13 July 2015

For the dispatcher in the Language Engine JS library, it was necessary to be able to efficiently test a number of pre-defined sets for subset status of an arbitrary given set. In particular, the pre-defined subsets are the keyword argument name patterns, and the arbitrary given set is the set of actual keyword argument names. To implement this efficiently, I designed a nice little algorithm using tries, which I'm going to describe below (only using Haskell, not JavaScript).

The code associated with this blogpost, which includes relevant dependencies, can be found here.

The problem

Language Engine translates the meanings of sentences into descriptions of the events that the sentences describe. For example, the sentence "Brutus stabbed Caesar" can be thought of as being translated into a description Stabbed(stabber = Brutus, stabbee = Caesar), which intentionally looks like a keyword argument function call.

A command such as "move the file to Foo" might therefore look something like Move(mover = Computer, moved = TheFile, destination = Foo), which should then be passed along to the host app to be handled. It's also useful to be able to specify different arguments. For instance, "copy the file" vs. "copy the file to the desktop", where the former implicit just means copy it right here, while the second means put a copy on the desktop. So it's necessary to handle Copy when it's specified with just the keyword code>original, but also with the keyword destination.

In general, then, the problem is, for each "function", eg. Move, Copy, etc., store a mapping from sets of keywords to handler functions to call. For Copy we might store {original} -> f, {original,destination} -> g. But if we simply stored a simple map from sets (or lists, rather) to functions, then it's plausible that in general, the lookup could get inefficient.

Additionally, we don't want to be very strict with keywords as well. If we have an event with keywords {a,b,c} and we have a pattern of keywords {a,b}, it would be good to be able to handle this, as a kind of fall-through. That is, we don't want to require the pattern keywords to be exactly the actual ones, just a subset, so that we can handle special-case commands with general-case patterns.

The number of subsets of n keywords is 2n, so it'd be good if we could have efficient lookup, and using subset matching instead of exact set matching only increases the number of possible matches, so efficiency is obviously quite important here.

An abstract solution

The solution I used can be described first abstractly, to make the gist of it clear. The first move we want to make, since we're going to implement the sets as lists, is to sort all the lists of interest and make sure their elements are unique. This creates a canonical representation of any given set, so that we only ever have to make reference to a single representation instead of managing equivalences.

The second move we're going to make is to imagine all of the possible subsets, and their associated values, forming a table. We don't make use of this table except to guide the design of the implementation, but it's useful to have. Each keyword-cell will be filled in if the keyword set has that keyword, and left empty otherwise. So for example, the above example with original and destination would correspond to the table

original destination handler
f
g

Here's a more complex, abstract collection of keywords and mappings, just to help visualize a situation with many options:

abcdhandler
H0
H1
H2
H3
H4

Just looking at a table like this, you can probably see some of where this is going. To match a set of keywords against this, let's say the keywords [a,c,d], we'll make a new row, above the label row, for these:

 
abcdhandler
H0
H1
H2
H3
H4

We'll then look at the cells of the rows from left to right. Looking at the acolumn, we see that some patterns require it (the first two), and the rest don't require it. a is present in the input, so the match so far says that each pattern is a subset of the input:

 
abcdhandler
H0
H1
H2
H3
H4

The logic of this step is: the pattern cell matches if both it and the input are on, or the pattern cell is off (independent of the input cell).

Moving to the next column, we see that the third and fourth rows have their cells on, but the input doesn't have this cell on:

 
abcdhandler
H0
H1
H2
H3
H4

We can therefore eliminate these rows entirely from the rest of the match process:

 
abcdhandler
H0
H1
H2
H3
H4

Moving on to the ccolumn now, we find every remaining row matches:

 
abcdhandler
H0
H1
H2
H3
H4

And now the last column, again, everything matches:

 
abcdhandler
H0
H1
H2
H3
H4

So we can use handlers H0, H1, and H4. We'd prefer the most specific one, so we have to pick either H1 or H4. In cases like this, the implementation for Language Engine just picks the first one arbitrarily, but in practice, you should always set up your handlers so that this never even arises.

The trie solution

In practice, the possible keywords are not easily delimited. In fact, essentially any key is a possible key. So we can never build a full table for all possibilities and just insert rows appropriately, as it would have infinitely many columns. Instead, we're given just some (sort) list of keywords.

The first observation we'll make to get to the trie solution is to realize that each pattern/row has a "first-on" column, and for any input, all the columns to its left (i.e. with keywords sorted lower than the first-on column of the pattern) are irrelevant to the pattern.

The second observation is that we can collapse rows whenever their first-on columns are the same, at least locally. For instance, we don't need to test if the a column is on twice, we should really only test it once, for both handlers H0 and H1.

What we will do then is construct a trie, with nodes labeled by handlers or nothing, and the edges labeled by the successive keywords in the pattern. That is to say, the first node will have edges labeled by the first keywords in each pattern, and second nodes will have the second keywords in the corresponding patterns, and so forth.

The trie for the above abcd table is

Let's define the type of tries in Haskell:

data Trie k a = Node (Maybe a) (M.Map k (Trie k a))

We'll need a way to build tries, as well. It's inconvenient to build them by hand. So what we'll do is define some utility functions, one that builds trivial tries, and then a function that merges tries, so we can merge all the rows into one big trie.

simpleTrie :: Ord k => [k] -> a -> Trie k a
simpleTrie []     a = Node (Just a) M.empty
simpleTrie (k:ks) a = Node Nothing (M.fromList [(k, simpleTrie ks a)])


merge :: Ord k => [Trie k a] -> Trie k a
merge = foldl' aux (Node Nothing M.empty)
  where
    aux (Node ma ns) (Node ma' ns')
      = Node (maybe ma' Just ma)
             (M.unionWith aux ns ns')

If we now code up each pattern row, and then merge them together, we get exactly what we expect:

h0 = simpleTrie ["a"]     "H0"
h1 = simpleTrie ["a","d"] "H1"
h2 = simpleTrie ["b","c"] "H2"
h3 = simpleTrie ["b","d"] "H3"
h4 = simpleTrie ["c","d"] "h4"
tableTrie = merge [h0,h1,h2,h3,h4]

Next, we'll need a way to do lookups. A normal lookup of a sequence of keys would be easy, but that's not what we need. Firstly, we need to do a lookup that implements the above subset algorithm. Secondly, it needs to collect up all of the non-empty node labels and the paths to them, so that at the end, we can see which keyword sets matched and which handlers they correspond to.

We'll approach each of these in turn, starting with just doing a boolean lookup, to see if a list of keywords matches any branch of the trie using the above algorithm.

follow :: Ord k => [k] -> Trie k a -> Bool

-- when the path is finished, but node empty, we don't match
follow [] (Node Nothing  _) = False

-- when the path is finished, and the node has a value, we match
follow [] (Node (Just _) _) = True

-- when the path isn't empty, we recurse. for each each keyword, drop the
-- initial part of the path that's definitely < the key. if the path is
-- then empty it doesnt match, otherwise we follow, the rest of the path
-- on the subtrie so long as the next keys match
follow ks (Node ma ns)
  = M.foldrWithKey
      (\k t r -> r ||
        case dropWhile (< k) ks of
          [] -> False
          k':ks'
            | k == k'   -> follow ks' t
            | otherwise -> False)
      (case ma of { Nothing -> False ; Just _ -> True })
      ns

This process of dropping all the keywords that are < the edge's label is basically just a way of handling all of the left-most empty cells in one go to skip to these cells:

 
abcdhandler
H0
H1
H2
H3
H4

The drop constructs the appropriate subset of the input that we're matching. If the first element of the particular subset is equal to the specified cell symbol, they match, as in the first, second, and fifth cases. If not, as in cases three and four, the match for that particular alternative fails.

To capture the path, and the node, we need to only make some slight modifications:

follow :: Ord k => [k] -> Trie k a -> [(a, [k])]
follow ks t = aux [] ks t
  where
    -- when the path is finished, but node empty, we don't match
    aux ps [] (Node Nothing  _) = []
    
    -- when the path is finished, and the node has a value, we match
    aux ps [] (Node (Just a) _) = [(a,reverse ps)]
    
    -- when the path isn't empty, we recurse
    aux ps ks (Node ma ns)
      = M.foldrWithKey
          (\k t r -> r ++
            case dropWhile (< k) ks of
              [] -> []
              k':ks'
                | k == k'   -> aux (k:ps) ks' t
                | otherwise -> [])
          (case ma of { Nothing -> [] ; Just a -> [(a,ps)] })
          ns

Here we've pulled the function into an auxiliary function, and added an argument that accumulates the matched keywords. At each matching node, we capture the current keywords and the node value.

We now just need to take an arbitrary found pair (assuming there is one) with maximal length of the matched keyword set, and the associated value is the found value. If we had been storing appropriate functions at the nodes, we could build up a keyword-value map and give it to the function. However these details are relatively obvious and also less interesting.

Discussion

This process could be done equivalently (and perhaps more efficiently?) using a regular expression because the trie is equivalent to a non-deterministic finite state machine. We can also of course apply various standard techniques for optimizing tries.

I'm going to discuss another way to use tries, to efficiently compile pattern matches on algebraic data.

If you have comments or questions, get it touch. I'm @psygnisfive on Twitter, augur on freenode (in #languagengine and #haskell).