Symbolic Machine Learning

posted by Darryl on 25 December 2016

This post aims to provide a unifying approach to symbolic and non-symbolic techniques of artificial intelligence. Historically, symbolic techniques have been very desirable for representing richly structured domains of knowledge, but have had little impact on machine learning because of a lack of generally applicable techniques. Conversely, machine learning techniques tend to be highly numeric in nature, thus restricting their application domain, but within those domains they have excelled. This post outlines a general approach to representing symbolic problems that is completely amenable to non-symbolic techniques such as deep learning, bridging the gap once and for all. The application domains are quite extensive: from traditional symbolic AI problems such as theorem proving, robotic planning, program synthesis, inductive programming, natural language understanding, natural language generation, and game playing, to less traditional tasks such as chemical synthesis design, program refactoring, and fuzzy matching of symbolic data. This post provides overviews of how these tasks can be done in this framework. For the vast majority of these tasks, the solutions also constitute glass boxes: they are understandable rather than being black boxes.

This blog post is a synthesis of various ideas. The first is the idea to use graph representations for symbolic propositions, originating partially out of Martin Kay's work in natural language generation, but also in Neo-Davidsonian semantics and Minimal Recursion Semantics. The second is work by Rogers and Hahn on chemical fingerprints, by way of Unterthiner et al.'s work on toxicity prediction using deep learning. These two combined give the symbolic vector representations described below. The third is game playing techniques developed in recent years, which extend very well to problems in the symbolic domain.

This post also only gives an overview of the technique. Test cases showing how it can be applied to some simple domains, as a validation of the technique, will be posted in subsequent blogposts, and then linked here.

From Symbolic Data to Graphs

Symbolic data can be thought of as abstract syntax trees, and consequently as graphs. For example, the symbolic data Foo (Bar Baz) (Quux Doo Garply), which might be an element of some Haskell data type, for example, can be seen as the tree

G foo Foo bar Bar foo->bar quux Quux foo->quux baz Baz bar->baz doo Doo quux->doo garply Garply quux->garply

This tree can then be translated into a graph that has labeled nodes and labeled directed edges, namely

G foo Foo bar Bar foo->bar arg0 quux Quux foo->quux arg1 baz Baz bar->baz arg0 doo Doo quux->doo arg0 garply Garply quux->garply arg1

Notice we're using edge labels that are of the form argn for some n. We might alternatively want to use the label of the source node in place of arg, though it adds no extra information than we already have from knowing that label, so we use the generic version instead.

An important special case of symbolic data is binders. In general, symbolic data will be like the above, but representing variable binding is a place where the translation can be improved. There are two usual approaches to representing variable binding: real names, and de Bruijn indices. For example, the lambda term λx.λy.x could be represented as Lambda "x" (Lambda "y" (Variable "x")) using real names, or as Lambda (Lambda (Variable 1)) using de Bruijn indices. Translating these into graphs is straight forward, but this isn't necessarily what we want to do.

When using real names, a lambda node that has the name "x" and a variable node that has the name "x" are not guaranteed to be related to one another, because the name might be used in multiple places, for example, in the pair of functions 〈λx.x, λx.x〉. Of course, we could employ some kind of name convention that says we only ever use a name with only one binder, but this is inelegant.

When using de Bruijn indices, a lambda node doesn't specify what variable it binds at all, variables instead count how many binders "up" in the tree their binder is located, and so neither lambdas nor variables in isolation or even together can tell you whether they're related. It requires the full path from the binder to the variable, and moreover requires some math/counting, to establish a relationship between a lambda and a variable.

The whole point of real names and de Bruijn indices is precisely to establish a relationship between variables and their binders in trees, because we can't just draw a line from the variable to its binder to indicate the relationship. But in graphs we can! In a graph representation of binders, we don't need to use real names or de Bruijn indices at all, we can simply make the lambda node point to the variable node. Therefore we can represent λx.λy.x by the graph

G lambdax Lambda lambday Lambda lambdax->lambday arg1 varx Variable lambdax->varx arg0 lambday->varx arg1

and similarly we can represent the pair 〈λx.x, λx.x〉 by

G pair Pair lambdax1 Lambda pair->lambdax1 arg0 lambdax2 Lambda pair->lambdax2 arg1 varx1 Variable lambdax1->varx1 arg0 lambdax1->varx1 arg1 varx2 Variable lambdax2->varx2 arg0 lambdax2->varx2 arg1

with no mention of real names or de Bruijn indices, there's no ambiguity about which lambda binds which variable, and the relationship between them is directly via a subgraph. This means that these long-distance relationships between binders and variables become simple local relations in the graph representation, which is much cleaner and simpler.

For a more complete example, consider the symbolic proposition
∀x. ∃y. P x ∧ (Q y ∨ R x y), which would become the graph

G forall Forall varx Variable forall->varx arg0 exists Exists forall->exists arg1 vary Variable exists->vary arg0 and And exists->and arg1 or Or and->or arg1 p P and->p arg0 q Q or->q arg0 r R or->r arg1 p->varx arg0 q->vary arg0 r->varx arg0 r->vary arg1

From Graphs to Symbolic Vectors

With this translation into graphs in mind, we can now make a crucial observation: for any fixed set of node labels NL and edge labels EL, and for any natural number k, there is a finite number of graphs you can make with k nodes using those labels, ignoring whether or not these graphs correspond to complete pieces of symbolic data. For example, if you have k = 1, there are |NL| number of graphs, each of which consists of a single node with some label. If k = 2, then there are |NL|2(|EL| + 1) graphs, consisting of two labeled nodes with an optional edge. By extension, there is also a finite (but somewhat larger) number of graphs with at most k nodes, and so we can enumerate them all.

There are some decision points here, it should be pointed out. For instance, we might want to count graphs by always including a node's label, or we might want to count by making labels optional. This choice represents the ability to ignore certain information. Similarly, we might want edges to be always labeled, or optionally not, during counting. The choice here depends on the problem domain and utility.

Using such an enumeration, we can then ask, for any given piece of symbolic data represented in graph form, how many times does each graph in the enumeration occur as a subgraph of the symbolic data. This produces a vector of counts which can be used to characterize the symbolic data.

For example, let's consider the node label set {Foo,Bar}, and the edge label set {arg0,arg1}, and let's take k = 2. The graphs that form the indices of a symbolic vector for these parameters are

G a Foo b Baz c0 Foo c1 Foo c0->c1 arg0 d0 Foo d1 Foo d0->d1 arg1 e0 Foo e1 Bar e0->e1 arg0 f0 Foo f1 Bar f0->f1 arg1 g0 Bar g1 Foo g0->g1 arg0 h0 Bar h1 Foo h0->h1 arg1 i0 Bar i1 Bar i0->i1 arg0 j0 Bar j1 Bar j0->j1 arg1

A symbolic vector over these will therefore be a list of 10 numbers counting the occurrences of each of these little graphs. If we consider, for example, the simple graph that corresponds to the symbolic data Foo Bar Bar, namely

G a Foo b Bar a->b arg0 c Bar a->c arg1

then the symbolic vector for this will be <1,2,0,0,1,1,0,0,0,0>. The similar, but distinct, symbolic data Foo (Bar Bar), which has the graph

G a Foo b Bar a->b arg0 c Bar b->c arg0

gives the symbolic vector <1,2,0,0,1,0,0,0,1,0>. We can now compare these two vectors in various ways, such as cosine similarity, which in this case is 6/7, meaning they are fairly similar.

This representation is lossy, however. For any fixed k, there will always be some cases where two different pieces of symbolic data are mapped to the same symbolic vector. This non-isomorphism may have undesirable effects in the relevant size ranges under consideration, so the choice of k would need to be increased until undesirable amounts of lossiness are eliminated.

Outline of Future Applications

Below are included some outlines of future application domains. In particular, we discuss the various domains mentioned in the introduction. These are only outlines, though, because they're somewhat more difficult to implement than the above test cases. Either they require a lot of thought, time, and/or effort for reasons external to the machine learning portion, or they require lots of training data that does't currently exist.

One thing I do make note of below, however, is whether the domain in question has an extrinsic, formal concept of correctness which can be used to generate training data. This has been very useful in machine learning, especially for games and robotic control, because it means the quality of the resulting system is not limited by uncontrollable factors such as how much corpus data exists and has been adequately marked up. There is, for example, no useful quantity of naturally produced Go game training data, but for the game has fixed inputs and well-defined win conditions, so we can simply play lots of (decreasingly) random games to generate new data and thereby get out Alpha Go which can beat world champions. Some of the domains below fit into this model, and are therefore the low hanging fruit of experimentation.

Game Playing

Game playing with machine learning techniques are a good starting point for discussing future applications because these techniques can be generalized to many of the other domains.

Currently, there are many game playing systems that use feature vectors that encode game state as the input to machine learning algorithms. The algorithms act are classifiers of the input vectors. It's not immediately obvious how you can use this to play games, but the principle is simple: the overwhelming majority of games have an effectively finite number of inputs that can be used to control the game (e.g. mouse buttons, keyboard buttons, gamepad buttons), so we can classify any given game state by which input is best for that game state. We transform the problem of playing a game into the problem of answering questions like "should I press 'jump' right now?"

Currently, the way game state is modeled is relatively direct. For games like Chess or Go or other similar games, we can model it simply by taking the vector's indices to be the board positions, and the value at an index is the game piece at that position. A Go board, for instance, is just a vector of length 361 (19×19 board spaces) with 3 possible values for each: black, white, and unoccupied. Chess similarly has a 64 long vector (8×8 board spaces) with 17 values: 16 pieces and unoccupied. Other games, such as Super Mario Bros., are represented by the actual image input for the game.

The symbolic approach outlined above makes it possible to represent game states in a much richer fashion. For games such as Go or Chess, probably no real change is necessary or even possible, but for games like Super Mario Bros., we can represent the state of a game by a large set of propositions that describe the game in useful ways. For instance, we might want to represent spatial relationships (above, below, in front of, behind, etc.), movement relationships (towards, away from), actions (jumping, crouching), and so forth. We might even want to represent more complicated event structures, such as running towards things or jumping over things, which involves multiple things related in complex ways.

For example, if Mario is jumping over a pipe and a Piranha Plant is coming out of the pipe, we might represent that as the set of propositions jumping(mario), over(mario,pipe_12), coming_out_of(piranha_plant_3,pipe_12), taking the conjunction to be implicit and irrelevant to the detail of the meaning. This can obviously then be translated into a graph, like so:

G mario mario jumping jumping jumping->mario arg0 pipe_12 pipe_12 over over over->mario arg0 over->pipe_12 arg1 piranha_plant_3 piranha_plant_3 coming_out_of coming_out_of coming_out_of->pipe_12 arg1 coming_out_of->piranha_plant_3 arg0

The graph is then convertible to a symbolic vector as described above, and we get now use this as input to the classification algorithm to decide what move is best to make.

A further refinement of game playing can be made using techniques from robotic planning described below, treating the player character as a robot. Additionally, we can use the natural language generation techniques described below to improve the conversation component of many modern games, possibly producing real, non-trivial, non-scripted conversations, which have so far been impossible.

Since games have win conditions that are defined externally, we can automatically generate training data by playing huge numbers of games, using the same standard techniques currently in use. The only major difference is that we need sufficient access to the game to be able to construct these rich propositional representations of game state.

Theorem Proving

The structure of many/most modern theorem provers and proof assistants such as Coq and Agda involves representing proofs as trees, with propositions that need to be proven residing at designated leaf nodes. The act of proving is the act of applying inference rules to these leaf nodes until no more unproven leaf nodes are present. Proof assistants are therefore a user-friendly interface for manipulating such proof trees using inference rules, while theorem provers perform the manipulations themselves, effectively working the controls on the proof assistants.

This is obviously viewable as a special kind of game, where the proof tree, with a single node in focus, is the game state, and the inference rules are the controls. In fact, this view is present already in the literature on proof theory in the form dialogical games.

The precise way of representing things is subtle. For instance, it is probably very useful to have some amount of access to the surrounding proof for a given goal leaf node, as it's likely to be the case that some inference rules are obviously less useful in some contexts.

Additionally, rule use itself is not trivial, since many rules are parameterized. Consider, for instance, the left rules of the Intuitionistic Sequent Calculus. These rules apply to a whole set of propositions, by requiring the prover to pick a proposition of the appropriate shape. There is therefore a non-determinism in which proposition it applies to. You not only pick a left rule to use, you have to specify what to use it on. We therefore need to think of inference rules not just as single decisions, but possible as decision trees, where making one choice (such as Use A Left Rule) leads to another choice (such as Pick Proposition 3). Each of these choices may have additional context that can be used.

Since the provability of logical propositions is determined entirely by the proof rules used for constructing proofs, and since propositions in those systems can be generated randomly, we can generate arbitrarily many propositions and attempt to prove them, thereby generating as much training data as desired. Again, we can use the standard game playing techniques mentioned above.

Robotic Planning

Robotic planning has been modeled at various times as a problem of logic and theorem proving, so naturally any such technique can be immediately treated as a theorem proving problem as described above. One important example of this approach to planning can be found in the work of Uluç Saranli and Frank Pfenning, who use linear logic to represent the facts about the world, and also the possible changes of state that can occur. Via the Curry-Howard Correspondence, proofs of possible changes of state turn out to be programs which, when run (e.g. by controlling a robot), will bring about those changes of state.

These techniques can naturally be applied to games as well, since game characters can be seen as indistinguishable from robots from the perspective of the computer. Work by Chris Martens shows how to apply linear logic to representing games of all sorts. We can therefore use this work together, with theorem proving as outlined above, to provide machine learning techniques for games with very richly structured worlds.

For game planning, training data can be generated in the obvious way. For actual physical robots, however, data is limited by access to actual robots or representative simulations. Using real robots puts substantial cost on generating training data. In principle, it's possible to do this, as Google has shown in its use of many robotic arms to perform hundreds of thousands of attempts to pick up assorted objects. In practice, however, this kind of training data generation is limited to only very well funded, or very patient, research projects.

Inductive Programming

Program construction can itself be seem through the same lens as theorem proving and game playing, only instead of constructing a proof tree, the task is to construct an abstract syntax tree for the program. When performed by a human, this is usually done with text, but one alternative is to use what's called a structural editor, where the interactions are operations on the AST of the program directly. For example, the user might invoke an operation called New Function which creates a rough outline of a function, and then provides several sub-decisions to make, such as the functions name, and its argument names.

By treating a structural editor just as we treat a theorem prover above, we can control the editor with a machine learning classifier. For inductive logic programming, the inputs to the classifier would be examples of input-output behavior that we want the program to produce. In the simplest case, we naturally would have just a single input-output pair, requiring two symbolic vectors, one representing the input data and one representing the output data. A more practical set up would be to have a large number of such pairs, so that we can provide many examples. We can use this technique to explore the search space of possible programs, checking candidate programs against the supplied examples.

This technique has been explored by Martin Hofmann in a non-machine-learning context, who found good results on a variety of simple problems. It's unknown if the technique scales to more complex problems, however. Alternatives may need to be found for more general algorithm design tasks.

The space of possible programs can be narrowed further by combining with the type directed synthesis approach given below.

This problem isn't entirely capable of generating training data. While it's entirely possible to randomly generate input-output pairs, that seems like an unproductive approach. However, instead of randomly generating data for random functions like that, it's entirely possible to randomly sample known functions in an existing standard library. We could then ingest standard libraries, randomly sample its functions to generate example data.

Program Synthesis

Contrasting with inductive programming, which works from example behavior, program synthesis constructs programs satisfying a specification, typically in the form of a set of propositions that constrain the value, or a type that the value must have. For example, if we're defining a function f, we could say that it must be such that for every x > 5, it is the case that f(x) < 3. Alternatively, we might specify that f has some type (a -> b) -> [a] -> [b]. In fact, in a sufficiently rich type system, such as those used by dependently typed languages like Agda, propositions can be turned into types, thereby making the entire problem uniformly about types.

In this context, then, we can use the structural editor approach mentioned above, together with the theorem proving approach's representation of input vectors. We can more or less just treat the program synthesis problem as a case of theorem proving, thanks to the Curry-Howard correspondence, which says types are propositions and programs are proofs. So, by proving a proposition (i.e. type), we're constructing a proof (i.e. program) satisfying it. This merely requires that our programming language has sufficient types to represent that properties we want it's programs to have.

This technique has been explored by the functional program community as a whole over the last few decades, with much work on program calculation and synthesis to be found. Additionally, Hofmann's techniques are applicable here as well. There are also a number of implementations of program synthesis for languages such as Haskell, but without the machine learning aspects suggested here, such as Magic Haskeller.

This task can more readily generate training data than inductive program simply because it's a special case of theorem proving, and so random types and constraining propositions can be generated.

Natural Language Understanding

Many problems in natural language understanding, such as disambiguation (especially attachment ambiguities, pronominal reference ambiguities, and quantifier scope ambiguities) would benefit greatly from having statistical models that make use of the meanings of the sentences that are ambiguous. But typical representations of propositions as symbolic abstract syntax trees with binders make it somewhat difficult to do this. The symbolic vector representation presented above makes it trivial to collect statistics on the relevant portions of meanings.

Unlike other techniques mentioned here, natural language understanding is very difficult to generate training data for. This is because the standard for correctness is humans themselves, not some formal process. As a result, like other linguistic data such as the Penn Tree Bank, trained humans are required to analyze and mark up data with symbolic meanings.

Natural Language Generation

Natural language generation techniques using graph representations of meanings have been around for a while, thanks to work by Kay. As mentioned above, this blog post is partially the result of a synthesis of this technique with others. In that setting, the goal is to encode a full proposition using grammatical rules annotated with subgraphs that indicate which parts of meaning the rule encodes. It's therefore very similar to a theorem proving problem, and we can use the above techniques for natural language generation, by driving the selection of grammar rules by the overall meaning and the portion of the overall meaning that remains to be encoded.

As with natural language understanding, this task doesn't lend itself well to generation of new training data, because the standard of correctness is humans.

Chemical Synthesis

Chemical synthesis problems, where we try to find ways to synthesize a particular molecule, are potentially addressable using these techniques as well. Since chemicals can be modeled as graphs, the vector representations that Unterthiner et al. use can be used as representations of the target molecule. Chemical reactions, in the form of reaction formulas, can be seen as a kind of inference rule. Putting these two things together, we can use the theorem proving techniques described above to search for chemical synthesis pathways. In this approach, molecules would for the analog of propositions and known chemical reactions would be inference rules. This would be a kind of linear theorem proving, allowing us to make use of various efficiencies that are gained from linear logic proof search, as well as the ability to view linear inference as state transformation (the state here being the chemicals in the reaction vessel).

If you have comments or questions, get it touch. I'm @psygnisfive on Twitter, augur on freenode (in #languagengine and #haskell). If you prefer Hacker News, there's a thread here.