Blog

Regex Edit Distance

posted by Darryl on 20 May 2016

The concept of edit distance is a useful one, allowing one to calculate how much difference there is between two strings. The idea has been generalized to trees, where instead of comparing two strings, you compare two n-ary branching trees. In this post, I'll show another generalization, which permits the comparison of a string with a regular expression to calculate how well the string matches the regular expression.

This post doesn't contain any actual code, just sketches of an algorithm, because how it gets implemented ends up being somewhat boring. I do, however, have a github repo here with a Haskell implementation.

More ...

Differentiating Regular Expressions

posted by Darryl on 8 May 2016

A while back, I was playing with regular expressions and derivatives of datatypes (in the sense of McBride, which you can read about here). At some point it dawned on me that given any particular regular expression, the one-hole contexts into that expression have a tight connection to the states of a non-deterministic finite automata that recognizes the expression.

Note that this notion of differentiation is at the type level, not at the value level as described in Matt Might's blog post.

A repo with the source code in this post is on Github here.

More ...

Generic Evaluators

posted by Darryl on 29 September 2015

This post is somewhat tangential to what I've been doing with Language Engine, but since it's about programming language design, and since I'm designing a language for LE, I guess it's kind of related.

The Github repo for this blog post can be found here.

The topic of todays post is an answer to a question I posed on Twitter a few days ago about whether or not it would be possible to define evaluation in a way that was "style generic". That is to say, whether you use the standard lazy or eager evaluation functions that just recurse using the host language's mechanism, or a stack machine that builds up evaluation contexts, or whatever, the evaluated term you produce at the end is the same — the only difference is the style of the evaluator, not the meaning of evaluation. So is there a way to define the lazy evaluator, or the eager evaluator, or the lazy stack machine, and so on, and then choose one or the other as you want, for use in your language, independent of the language itself? The answer, I think, is yes, and so I'll present here my solution.

More ...

Compiling Pattern Matching

posted by Darryl on 21 September 2015

Pattern matching is absolutely central to modern functional programming languages, and in the type theoretically informed ones, it forms the basis of all usage of user-defined data. To inspect some piece of data, to extract part of it, etc. is all done by pattern matching. Since it's so central, and so all-permeating, it's a good idea to make it as efficient as possible.

More ...

Abstract Chart Parsing

posted by Darryl on 17 July 2015

About a year ago I posted a tutorial on Github about the chart parser I'm using for Language Engine's linguistic parsing. In the intervening year, I've rethought parts of it, and come up with a new, more general, abstract approach to chart parsing that I'm going to call Abstract Chart Parsing, because it abstracts over the details of the chart and the rules in very useful ways. An alternative name, owing to its design, might be State Machine Chart Parsing.

More ...

Precedence with Haskell

posted by Darryl on 17 July 2015

I want to summarize an approach to precedence (for pretty printing and parsing) that I've recently come up with in part of a series of blog posts I'm writing. It's useful enough, I think, to separate out into it's own post, because precedence is such a constant thorn in people's sides.

More ...

Efficient Subset Matching with Tries

posted by Darryl on 13 July 2015

For the dispatcher in the Language Engine JS library, it was necessary to be able to efficiently test a number of pre-defined sets for subset status of an arbitrary given set. In particular, the pre-defined subsets are the keyword argument name patterns, and the arbitrary given set is the set of actual keyword argument names. To implement this efficiently, I designed a nice little algorithm using tries, which I'm going to describe below (only using Haskell, not JavaScript).

The code associated with this blogpost, which includes relevant dependencies, can be found here.

More ...

Grammars as Data Types

posted by Darryl on 4 July 2015

There's an interesting connection between language and computer science in the form of a correspondence between grammars and inductively defined data types. I hope the correspondence isn't too unforeseen, but it's still worth making as explicit as possible, to see some variations.

More ...

Sierpinski Triangles in Bitwise Logic

posted by Darryl on 27 May 2015

Below you can see a Sierpinski Triangle drawn in a canvas. What makes this particular Sierpinski Triangle interesting is that it's generated by setting the color of each pixel to a thresholded bitwise & on the coordinates of the pixel. Check out the source code to see the precise details of how this is drawn, or read below for a deeper discussion.

More ...

Making Sense of Subtypes

posted by Darryl on 16 Apr 2015

Subtypes are notoriously easy to mess up when designing a programming language, but they really shouldn't be. I think part of the reason is that most people try to set out their subtyping rules by gut instinct, because they don't really know how to do it properly. I don't know of any good tutorial on this, so fair enough. Here's that tutorial.

More ...

Error Correction with Edit Distances

posted by Darryl on 14 Apr 2015

One of the main benefits of LTSG parsing was that it could narrow the parsing problem by guiding the parse, establishing subgoals, making the problem nice and recursive. The main drawback, however, was that no work would be done if the head of a phrase (including the whole sentence) was missing. On the other hand, Unger parsing could parse in a recursive goal-directed way regardless of missing words, but was unguided by head information, and has poor parse times due to non-determinism.

If we could somehow have the benefits of these approaches, without the drawbacks, we might have a shot at a good error correcting parser. One option that could be useful is to parse bottom up with a chart parser, to produce as much partial parse information as possible, and then work in a top-down fashion to find and correct the errors. This can be done somewhat efficiently, using because most of the parse is already there.

More ...

Extensible Serialization

posted by Darryl on 24 Mar 2015

Why do we compute things? What's the point of computing? This seems like a silly question, but the answer to it is interesting and connects a lot of different things.

More ...

Why Compute?

posted by Darryl on 24 Mar 2015

Why do we compute things? What's the point of computing? This seems like a silly question, but the answer to it is interesting and connects a lot of different things.

More ...

Unger Parsing

posted by Darryl on 6 Mar 2015

In this third post in my series on parsing inputs that might have errors (previous post here), I'm going to talk about Unger parsing. Unger parsing is a simple parsing technique that can handle ambiguous grammars quite nicely, and turns out to also be rather well suited to handling errors as well. I'll introduce Unger parsing, since I suspect most people are unfamiliar with it, then discuss it's application to error handling, and then discuss its drawbacks.

More ...

Parsing LTSGs

posted by Darryl on 21 Feb 2015

In this blog post, I want to follow up on my previous post about robust error handling when parsing by discussing one interesting, though imperfect, solution to the problem that we get by using a Lexicalized Tree Substitution Grammar (LTSG), which is a non-adjunction fragment of a Lezicalized Tree Adjoining Grammar. If you're not familiar with TAGs or LTAGs, this tutorial provides an excellent overview. The section on constraint-based TAGs isn't necessary, just the ones up to LTAGs.

Apologies in advance for the images.

More ...

Parsing Malformed Input

posted by Darryl on 12 Feb 2015

In this post I want to talk about parsing inputs that might be malformed in some way. That is to say, this is about handling parse errors. Why might we want to do this? Well, if you're parsing a programming language, it's helpful to be able to inform the programmer why their code is malformed. If you're parsing a natural language, then there's a good chance that the speaker or writer won't provide perfectly grammatical sentences as input.

More ...

Grokking the Yoneda Lemma

posted by Darryl on 11 Feb 2015

Much has been written about the awesomeness that is the Yoneda Lemma. For instance, Edward Kmett said "When you need something to go faster, hit it with the Yoneda lemma. Still isn't fast enough? Use it again."

But what exactly is the Yoneda Lemma? What does it mean, and why does it mean that? That's what I'm going to address in this blog post, at least for the Haskell version of the lemma. If you're not familiar with Haskell, tho.. sorry.

More ...

"You must be nuts!"

Why I'm building Language Engine

posted by Darryl on 23 Jan 2015

Heh, yeah, maybe. You've got to be a little nuts to start a startup, and probably quite nuts to start one with such audacious goals. But they're important goals.

More ...

Type Checking in JavaScript

posted by Darryl on 16 Jan 2015

I'd like to follow up to my previous blog post on implementing a simply typed lambda calculus in JavaScript by demonstrating a more complete, more natural type system. If you haven't read it already, you should do so. The previous type system was a little unnatural because it had no type inference mechanism, and so all of the relevant type information had to be provided in the programs themselves. This time, however, we'll build a type checker that has inference, so we can write more natural programs.

More ...

What Happened to Old School NLP?

posted by Darryl on 11 Jan 2015

This is my personal take on things, not necessarily an accurate history of the field. Take it with a huge grain of salt!

For the last 25 or so years, NLP has consisted almost entirely of fairly simple statistical models of language have prevailed. Rather than using carefully build grammars and fancy parsers, rather than trying to build systems that rely on linguistics, instead modern NLP tries to learn as much as possible from scratch. NLP research is now, essentially, just a branch of machine learning rather than a branch of linguistics. What happened?

More ...

The Conversational AI
Problem Landscape

posted by Darryl on 10 Jan 2015

In this blog post, I'm going to discuss some of the major problems that I think conversational AI faces in achieving truly human-like performance. The issues I have in mind are

- Syntax / Grammar
- Non-idiomatic Meaning
- Idiomatic Meaning
- Metaphors / Analogies
- Speaker's Intentions and Goals

Some of them are, in my opinion, relatively easy to solve, at least in terms of building the necessary programs/tools (data and learning is a separate issue), but others, especially the last one, are substantially harder. But hopefully I can convince you that at least the first four are something we can seriously consider working on.

More ...

Deep Learning Won't Cure
the Language Cold

posted by Darryl on 5 Jan 2015

Every once in a while I get asked about what sort of machine learning I used to build Language Engine, and when I say I didn't use any, the response is something like "But deep learning is so big these days!" While I'm tempted to just pull a Peter Thiel and say this buzzword-y nonsense is just fashion, and suggest you avoid fashions like the plague, I want to explain a bit instead. So in this post, I'm going to elaborate on how deep learning relates to Language Engine, and why we don't use it, or any other kind of machine learning, and why future versions will only use minimal amounts.

More ...

Continuations for Quantification

posted by Darryl on 27 Dec 2014

In this post, I'm going to talk about quantifiers. Not quantifiers as in the type theoretical quantifiers Pi and Sigma, but rather natural language quantifiers like "some", "all", "no", etc. One thing we need to do in implementing Language Engine is have a way of representing the meanings of these, and more specifically, the scope-taking properties that they have.

For simple sentences like "John saw Susan", it's easy to pretend like "John" and "Susan" just pick out some entities, perhaps like so:

[[John]] = john
[[Susan]] = susan
[[saw]] = \x. \y. saw(y,x)
[[saw Susan]] = \y. saw(y, susan)
[[John saw Susan]] = saw(john, susan)

But this kind of analysis becomes tricky when we start considering sentences like "everyone saw Susan" or "John saw everyone".

More ...

So you want to write a type checker...

posted by Darryl on 23 Dec 2014

So you want to write a type checker, but you don't know how. Well in this post, I'm going to show you how to implement one using JavaScript, for the simply typed lambda calculus with pairs and functions. JS is not the most ideal language for this, unfortunately, because it lacks a good type system of its own, and makes a bunch of stuff really ugly and noisy, but in a way, that's ideal. It'll give you a sense of what you'd have to do in other languages with similarly impoverished type systems. The way I'm going to implement it should be relatively easy to translate into idiomatic code in various.

You can see the full code for this type checker on GitHub. There is also a Haskell version of the same type checker, for comparison. Question info at the bottom of this post as usual.

I'm assuming you know a little bit about type theory, such as having looked at the Pfenning videos that I posted in this little guide. If not, I highly recommend you brush up, otherwise some of this might be confusing.

More ...

Paraphrases and
Lexical Decomposition

posted by Darryl on 20 Dec 2014

In this post, I want to tackle an important issue when designing a lexicon for an app that uses Language Engine, namely: how do you cope with the variety of seemingly different ways that people can express the same or nearly-same idea? For example, consider the following interaction you could imagine taking place in an Instragram-like app:

User: Take a selfie!
App: Ok! *shutter* How's that?
User: Oh, it's awful!
App: What about with one of these filters?

This conversation, especially line 3 where the user says "it's awful", can be paraphrased many different ways. For instance, the user could have used a different adjective and said "it's horrible", or they could've combined a positive adjective with a negative by saying "it's not good". They could've alternatively used a verbal construction such as "I dislike it" or "I hate it", or a negated verbal construction, as in "I don't like it".

And yet these all basically express the same attitude towards the picture. A simple approach to meaning would end up giving us one meaning for each sentence, which would be pretty bad. Can we do better? Yes!

More ...

Pluralities in Semantics (part 3)

posted by Darryl on 14 Dec 2014

As promised last time, we can now tackle the problem of representing pluralities. Once we have events and adverbial modifiers separated off as in

drink(e,stephen,coffee) & time(e,past) & time(e,today)

it becomes relatively trivial to propose that we do the very same thing to the arguments of verb predicates, as well. There are a number of alternative ways to do this, but one option is to have special relations associated with the predicate.

More ...

Pluralities in Semantics (part 2)

posted by Darryl on 12 Dec 2014

I ended the previous post by suggesting that a set-based approach to pluralities, as in

[[John and Susan lifted the grate]] = lifted({john,susan}, {crate})

would not work well for at least two reasons: [...]

The task is to present an alternative representation that doesn't have these problems. But before I can do that, we need to detour a little bit to discuss adverbial modification, which I will do in this post.

More ...

Pluralities in Semantics (part 1)

posted by Darryl on 10 Dec 2014

In this post I'm going to discuss some of the design issues encountered while implementing Language Engine, specifically the issue of plural noun phrases.

First, I need to briefly introduce some basic ideas about the meanings of sentences. Suppose that we want to interpret the sentence "Vir stabbed the Emperor" into some logical form as the meaning. A very traditional approach to this would be to assign it the meaning

stabbed(vir,emperor)

More ...