*posted by Darryl on 13 July 2015*

For the dispatcher in the Language Engine JS library, it was necessary to be able to efficiently test a number of pre-defined sets for subset status of an arbitrary given set. In particular, the pre-defined subsets are the keyword argument name patterns, and the arbitrary given set is the set of actual keyword argument names. To implement this efficiently, I designed a nice little algorithm using tries, which I'm going to describe below (only using Haskell, not JavaScript).

The code associated with this blogpost, which includes relevant dependencies, can be found here.

### The problem

Language Engine translates the meanings of sentences into descriptions of the events that the sentences describe. For example, the sentence "Brutus stabbed Caesar" can be thought of as being translated into a description `Stabbed(stabber = Brutus, stabbee = Caesar)`

, which intentionally looks like a keyword argument function call.

A command such as "move the file to Foo" might therefore look something like `Move(mover = Computer, moved = TheFile, destination = Foo)`

, which should then be passed along to the host app to be handled. It's also useful to be able to specify different arguments. For instance, "copy the file" vs. "copy the file to the desktop", where the former implicit just means copy it right here, while the second means put a copy on the desktop. So it's necessary to handle `Copy`

when it's specified with just the keyword code>original, but also with the keyword `destination`

.

In general, then, the problem is, for each "function", eg. `Move`

, `Copy`

, etc., store a mapping from sets of keywords to handler functions to call. For `Copy`

we might store `{original} -> f, {original,destination} -> g`

. But if we simply stored a simple map from sets (or lists, rather) to functions, then it's plausible that in general, the lookup could get inefficient.

Additionally, we don't want to be very strict with keywords as well. If we have an event with keywords `{a,b,c}`

and we have a pattern of keywords `{a,b}`

, it would be good to be able to handle this, as a kind of fall-through. That is, we don't want to require the pattern keywords to be exactly the actual ones, just a subset, so that we can handle special-case commands with general-case patterns.

The number of subsets of n keywords is 2^{n}, so it'd be good if we could have efficient lookup, and using subset matching instead of exact set matching only increases the number of possible matches, so efficiency is obviously quite important here.

### An abstract solution

The solution I used can be described first abstractly, to make the gist of it clear. The first move we want to make, since we're going to implement the sets as lists, is to sort all the lists of interest and make sure their elements are unique. This creates a canonical representation of any given set, so that we only ever have to make reference to a single representation instead of managing equivalences.

The second move we're going to make is to imagine all of the possible subsets, and their associated values, forming a table. We don't make use of this table except to guide the design of the implementation, but it's useful to have. Each keyword-cell will be filled in if the keyword set has that keyword, and left empty otherwise. So for example, the above example with `original`

and `destination`

would correspond to the table

original | destination | handler |

f | ||

g |

Here's a more complex, abstract collection of keywords and mappings, just to help visualize a situation with many options:

a | b | c | d | handler |

H_{0} |
||||

H_{1} |
||||

H_{2} |
||||

H_{3} |
||||

H_{4} |

Just looking at a table like this, you can probably see some of where this is going. To match a set of keywords against this, let's say the keywords `[a,c,d]`

, we'll make a new row, above the label row, for these:

a | b | c | d | handler |

H_{0} |
||||

H_{1} |
||||

H_{2} |
||||

H_{3} |
||||

H_{4} |

We'll then look at the cells of the rows from left to right. Looking at the `a`

column, we see that some patterns require it (the first two), and the rest don't require it. `a`

is present in the input, so the match so far says that each pattern is a subset of the input:

a | b | c | d | handler |

H_{0} |
||||

H_{1} |
||||

H_{2} |
||||

H_{3} |
||||

H_{4} |

The logic of this step is: the pattern cell matches if both it and the input are on, or the pattern cell is off (independent of the input cell).

Moving to the next column, we see that the third and fourth rows have their cells on, but the input doesn't have this cell on:

a | b | c | d | handler |

H_{0} |
||||

H_{1} |
||||

H_{2} |
||||

H_{3} |
||||

H_{4} |

We can therefore eliminate these rows entirely from the rest of the match process:

a | b | c | d | handler |

H_{0} |
||||

H_{1} |
||||

H_{2} |
||||

H_{3} |
||||

H_{4} |

Moving on to the `c`

column now, we find every remaining row matches:

a | b | c | d | handler |

H_{0} |
||||

H_{1} |
||||

H_{2} |
||||

H_{3} |
||||

H_{4} |

And now the last column, again, everything matches:

a | b | c | d | handler |

H_{0} |
||||

H_{1} |
||||

H_{2} |
||||

H_{3} |
||||

H_{4} |

So we can use handlers H_{0}, H_{1}, and H_{4}. We'd prefer the most specific one, so we have to pick either H_{1} or H_{4}. In cases like this, the implementation for Language Engine just picks the first one arbitrarily, but in practice, you should always set up your handlers so that this never even arises.

### The trie solution

In practice, the possible keywords are not easily delimited. In fact, essentially any key is a possible key. So we can never build a full table for all possibilities and just insert rows appropriately, as it would have infinitely many columns. Instead, we're given just some (sort) list of keywords.

The first observation we'll make to get to the trie solution is to realize that each pattern/row has a "first-on" column, and for any input, all the columns to its left (i.e. with keywords sorted lower than the first-on column of the pattern) are irrelevant to the pattern.

The second observation is that we can collapse rows whenever their first-on columns are the same, at least locally. For instance, we don't need to test if the `a`

column is on twice, we should really only test it once, for both handlers H_{0} and H_{1}.

What we will do then is construct a trie, with nodes labeled by handlers or nothing, and the edges labeled by the

The trie for the above abcd table is

Let's define the type of tries in Haskell:

`data Trie k a = Node (Maybe a) (M.Map k (Trie k a))`

We'll need a way to build tries, as well. It's inconvenient to build them by hand. So what we'll do is define some utility functions, one that builds trivial tries, and then a function that merges tries, so we can merge all the rows into one big trie.

```
simpleTrie :: Ord k => [k] -> a -> Trie k a
simpleTrie [] a = Node (Just a) M.empty
simpleTrie (k:ks) a = Node Nothing (M.fromList [(k, simpleTrie ks a)])
merge :: Ord k => [Trie k a] -> Trie k a
merge = foldl' aux (Node Nothing M.empty)
where
aux (Node ma ns) (Node ma' ns')
= Node (maybe ma' Just ma)
(M.unionWith aux ns ns')
```

If we now code up each pattern row, and then merge them together, we get exactly what we expect:

```
h0 = simpleTrie ["a"] "H0"
h1 = simpleTrie ["a","d"] "H1"
h2 = simpleTrie ["b","c"] "H2"
h3 = simpleTrie ["b","d"] "H3"
h4 = simpleTrie ["c","d"] "h4"
tableTrie = merge [h0,h1,h2,h3,h4]
```

Next, we'll need a way to do lookups. A normal lookup of a sequence of keys would be easy, but that's not what we need. Firstly, we need to do a lookup that implements the above subset algorithm. Secondly, it needs to collect up all of the non-empty node labels and the paths to them, so that at the end, we can see which keyword sets matched and which handlers they correspond to.

We'll approach each of these in turn, starting with just doing a boolean lookup, to see if a list of keywords matches any branch of the trie using the above algorithm.

```
follow :: Ord k => [k] -> Trie k a -> Bool
-- when the path is finished, but node empty, we don't match
follow [] (Node Nothing _) = False
-- when the path is finished, and the node has a value, we match
follow [] (Node (Just _) _) = True
-- when the path isn't empty, we recurse. for each each keyword, drop the
-- initial part of the path that's definitely < the key. if the path is
-- then empty it doesnt match, otherwise we follow, the rest of the path
-- on the subtrie so long as the next keys match
follow ks (Node ma ns)
= M.foldrWithKey
(\k t r -> r ||
case dropWhile (< k) ks of
[] -> False
k':ks'
| k == k' -> follow ks' t
| otherwise -> False)
(case ma of { Nothing -> False ; Just _ -> True })
ns
```

This process of dropping all the keywords that are < the edge's label is basically just a way of handling all of the left-most empty cells in one go to skip to these cells:

a | b | c | d | handler |

H_{0} |
||||

H_{1} |
||||

H_{2} |
||||

H_{3} |
||||

H_{4} |

The drop constructs the appropriate subset of the input that we're matching. If the first element of the particular subset is equal to the specified cell symbol, they match, as in the first, second, and fifth cases. If not, as in cases three and four, the match for that particular alternative fails.

To capture the path, and the node, we need to only make some slight modifications:

```
follow :: Ord k => [k] -> Trie k a -> [(a, [k])]
follow ks t = aux [] ks t
where
-- when the path is finished, but node empty, we don't match
aux ps [] (Node Nothing _) = []
-- when the path is finished, and the node has a value, we match
aux ps [] (Node (Just a) _) = [(a,reverse ps)]
-- when the path isn't empty, we recurse
aux ps ks (Node ma ns)
= M.foldrWithKey
(\k t r -> r ++
case dropWhile (< k) ks of
[] -> []
k':ks'
| k == k' -> aux (k:ps) ks' t
| otherwise -> [])
(case ma of { Nothing -> [] ; Just a -> [(a,ps)] })
ns
```

Here we've pulled the function into an auxiliary function, and added an argument that accumulates the matched keywords. At each matching node, we capture the current keywords and the node value.

We now just need to take an arbitrary found pair (assuming there is one) with maximal length of the matched keyword set, and the associated value is the found value. If we had been storing appropriate functions at the nodes, we could build up a keyword-value map and give it to the function. However these details are relatively obvious and also less interesting.

### Discussion

This process could be done equivalently (and perhaps more efficiently?) using a regular expression because the trie is equivalent to a non-deterministic finite state machine. We can also of course apply various standard techniques for optimizing tries.

I'm going to discuss another way to use tries, to efficiently compile pattern matches on algebraic data.

If you have comments or questions, get it touch. I'm @psygnisfive on Twitter, augur on freenode (in #languagengine and #haskell).