*posted by Darryl on 1 Apr 2015*

It's frequently useful to serialize data, for storage or transmission over a network, or whatever other reasons. For simple, well-circumscribed problems, serialization can be done easily with a Forth-like schema, but a more flexible serialization library that supports extensibility, such as Haskell's `Data.Binary`

, can't use this technique. In this post, I'm going to explore why, and show at least one way to implement something roughly like `Data.Binary`

.

### Forth-y Serialization

Let's suppose we have a data type for binary trees with no labels, such as this:

`data TreeN = LeafN | BranchN TreeN TreeN`

Let's serialize this to a list of integers. Normally we'd use some other representation like byte strings or something of that nature, but for simplicity of the example, integers will suffice.

To do this, we need to represent each constructor — `LeafN`

and `BranchN`

— by a number, say, `0`

and `1`

. If we do an enumeration of node constructors via a post-order traversal, we can produce a Forth-like, or RPN-like, representation:

```
serializeN :: TreeN -> [Int]
serializeN LeafN = [0]
serializeN (BranchN l r) = serializeN l ++ serializeN r ++ [1]
```

So now, if we do some serializations, we get what we expect:

```
serializeN LeafN == [0]
serializeN (BranchN LeafN LeafN) == [0,0,1]
serializeN (BranchN LeafN (BranchN LeafN LeafN)) == [0,0,0,1,1]
serializeN (BranchN (BranchN LeafN LeafN) LeafN) == [0,0,1,0,1]
```

Deserializing this is relatively simple, using a stack-based approach. Reading a `0`

pushes a `LeafN`

onto the stack, reading a `1`

pops two items from the stack, combines them with `BranchN`

, and pushes the result. This process could fail, of course, so we'll account for that.

```
deserializeN :: [Int] -> Maybe TreeN
deserializeN codes = go [] codes
where go [t] [] = Just t
go stack (0:rest) = go (LeafN:stack) rest
go (r:l:stack) (1:rest) = go (BranchN l r:stack) rest
go _ _ = Nothing
```

And naturally, if we deserialize the above serialized trees, we get them right back, wrapped in `Just`

.

Now let's suppose we want to serialize binary trees with boolean values in the leaves. Here's our type:

`data TreeB = LeafB Bool | BranchB TreeB TreeB`

Serializing is easy. We'll use `0`

and `1`

as before, and since `Bool`

has two constructors, we'll use `2`

and 3 for them:

```
serializeB :: TreeB -> [Int]
serializeB (LeafB True) = [2,0]
serializeB (LeafB False) = [3,0]
serializeB (BranchB l r) = serializeB l ++ serializeB r ++ [1]
```

We can now write a deserializer easily enough, with the help of a tagging type for the stack:

```
data StackTagB = BoolTag Bool | TreeBTag TreeB
deserializeB :: [Int] -> Maybe TreeB
deserializeB codes = go [] codes
where go [TreeBTag t] [] = Just t
go stack (2:rest) = go (BoolTag True:stack) rest
go stack (3:rest) = go (BoolTag False:stack) rest
go (BoolTag b:stack) (0:rest)
= go (TreeBTag (LeafB b):stack) rest
go (TreeBTag r:TreeBTag l:stack) (1:rest)
= go (TreeBTag (BranchB l r):stack) rest
go _ _ = Nothing
```

This works, it's fine and dandy, so let's move on to another, more complex example: trees with integers in the leaves. Actually let's just say non-negative integers, for simplicity.

`data TreeI = LeafI Int | BranchI TreeI TreeI`

How might we write the serialization algorithm? Well, we need to store the integers somehow in the serialization in a way that distinguishes between the stored integers, and the integers-as-codes-for-constructors. But that's easy: we can shift them up by two, so that no conflicts are created:

```
serializeI :: TreeI -> [Int]
serializeI (LeafI n) = [n+2,0]
serializeI (BranchI l r) = serializeI l ++ serializeI r ++ [1]
```

Now deserialization, with a similar stack tag trick:

```
data StackTagI = IntTag Int | TreeITag TreeI
deserializeI :: [Int] -> Maybe TreeI
deserializeI codes = go [] codes
where go [TreeITag t] [] = Just t
go stack (n:rest)
| n > 1 = go (IntTag (n-2):stack) rest
go (IntTag n:stack) (0:rest)
= go (TreeITag (LeafI n):stack) rest
go (TreeITag r : TreeITag l : stack) (1:rest)
= go (TreeITag (BranchI l r):stack) rest
go _ _ = Nothing
```

Now we're starting to see, however, that this is requiring more and more special coding. With booleans, we could just assign codes to the boolean values since there's only two. With integers, we need a way of shifting all of them out of the way of the codes for the constructors of the trees. As things get more complex, this schema will have to get more complex. And of course, more fragile.

Now what if we wanted a parametric tree, with whatever kind of data in the leaves, like so:

`data Tree a = Leaf a | Branch (Tree a) (Tree a)`

How could we possibly serialize that? We'd need serializers for the leaf data not just the tree as a whole. So we need to define a serialization system that's *extensible*. We could make our `serialize`

and `deserialize`

functions methods of a type class:

```
class Serializable a where
serialize :: a -> [Int]
deserialize :: [Int] -> Maybe a
```

If we now try to implement an appropriate instance for `Tree`

, parallel our definition for `TreeI`

, we find it's not quite acceptable:

```
instance Serializable a => Serializable (Tree a) where
serialize (Leaf x) = serialize x ++ [0]
serialize (Branch l r) = serialize l ++ serialize r ++ [1]
deserialize codes = ...
```

When we first encounter a `0`

, how do we know it means `Leaf`

instead of something else? When were serializing a type like `TreeB`

or `TreeI`

with a closed serialization function, we could pick the codes we wanted to use for each constructor, and even if the types were complex, we could at least circumscribed them, and determine what all the constructors could be, so there wouldn't be confusion. But since `Tree`

is parametric, there's no way to force other instances of `Serializable`

to not use `0`

as a code. So maybe the first `0`

means `Leaf`

, or maybe it means the integer `0`

or maybe it means something else entirely!

Put another way, what if we also had the instance

```
instance Serializable Int
serialize n = [n]
deserialize n = ...
```

Then we would get `0`

and `1`

occurring in serializations of `Tree Int`

, but they could come either from the codes for `Leaf`

and `Branch`

, or from the codes for the integers `0`

and `1`

! The add-2 trick won't work as before, because who again, who knows how many other types need serialization? Maybe it's actually add-3, or add-a-million, or add-infinity if we're serializing real numbers?

### A Better Approach: Real Parsing

The problem we face is that, because of extensibility, the same code could mean different things depending on what type it came from. So, the only way to ensure we can deserialize is if we know what type the code should correspond to.

With a post-order serialization, this isn't easy, because the next code in the serialization corresponds to the leftmost node in the data that it comes from, which could be arbitrarily deep and of arbitrary type. But with a *pre-order* serialization, the next code in the serialization corresponds to the *topmost* node in the data that it comes from.

If we work top-down, we always know what what kind of data we're expecting, so we can use that to tell us what the code represents. If we're expecting a tree, and see the leaf code next, it corresponds to a leaf. If we're expecting a tree and see a branch code next, it corresponds to a branch and we should expect two more trees. If we're expecting an integer, and see any integer at all, *even the leave or branch codes*, we still treat it as an integer not a tree.

This does mean, however, that it's harder to know when we're done deserializing. Before, we were done when the code list was empty and there was nothing left to process, but now, it's a little harder. We'll still wait for the code list to be empty, but it has to be empty at the end of the total deserialization, so we'll instead define deserialization in terms of a more fundamental, partial-deserializer or parser, that works on difference lists.

```
class Serializable a where
serialize :: a -> [Int]
parse :: [Int] -> Maybe (a,[Int])
deserialize :: Serializable a => [Int] -> Maybe a
deserialize codes = case parse codes of
Just (x,[]) -> Just x
_ -> Nothing
```

The `parse`

function will either return as much of a deserialization as possible, together with a remainder of the input that was not part of the serialized data, or it will fail because a code is incorrect for the expected type.

Here is an implementation for booleans and integers:

```
instance Serializable Bool where
serialize True = [0]
serialize False = [1]
parse (0:rest) = Just (True,rest)
parse (1:rest) = Just (False,rest)
parse _ = Nothing
instance Serializable Int where
serialize n = [n]
parse (n:rest) = Just (n,rest)
```

Notice that the `parse`

method for integers always succeeds, while for booleans it only succeeds when the next code is either `0`

or `1`

. If we now tried to deserialize some codes for booleans or integers, we get:

```
deserialize [0] :: Maybe Bool
== Just True
deserialize [0] :: Maybe Int
== Just 0
deserialize [1] :: Maybe Bool
== Just False
deserialize [1] :: Maybe Int
== Just 1
deserialize [2] :: Maybe Bool
== Nothing
deserialize [2] :: Maybe Int
== Just 2
```

Or if we compare deserialization to parsing, on some non-codes:

```
parse [0,2] :: Maybe (Bool,[Int])
== Just (True,[2])
deserialize [0,2] :: Maybe Bool
== Nothing
```

This successfully parses because `[0,2]`

starts with a code for `True`

, and the remainder of the code might be for something else, but it fails to deserialize because the expectation of deserialization is that the code is for *exactly* a boolean, not just a boolean and maybe some other stuff.

Let's now implement trees:

```
instance Serializable a => Serializable (Tree a) where
serialize (Leaf x) = [0] ++ serialize x
serialize (Branch l r) = [1] ++ serialize l ++ serialize r
parse (0:rest) = case parse rest of
Nothing -> Nothing
Just (n,rest') -> Just (Leaf n,rest')
parse (1:rest) = case parse rest of
Nothing -> Nothing
Just (l,rest') -> case parse rest' of
Nothing -> Nothing
Just (r,rest'') -> Just (Branch l r, rest'')
parse _ = Nothing
```

Now we can serialize trees containing booleans, and deserialize them too, even tho they both use the same codes in different ways. Let's see how this unfolds in an example:

```
serialize (Leaf True)
= [0,0]
```

The first `0`

came from the `Leaf`

node, and the second came from `True`

. If we now parse:

`parse [0,0] :: Maybe (Tree Bool, [Int])`

Because we're trying to parse a tree, we use the method specified for `Tree`

, which looks at the first code and recognizes it's a code for a leaf node, so it recursively tries to compute

`parse [0] :: Maybe (Bool,[Int])`

This recursive call is for `Bool`

instead, and so uses that method. It recognizes that this is a code for `True`

, and succeeds, returning `Just (True,[])`

. The remainder is empty at this point.

We now return to the computation for `parse [0,0] :: Maybe (Tree Bool, [Int])`

, and having successfully found a boolean, it succeeds and returns `Just (Leaf True, [])`

. The parse is finished, with an empty remainder. If we had called `deserialize`

instead, we would have succeeded with `Just (Leaf True)`

.

Even tho we have `0`

used to represent different things, because we're using different instances of `Serializable`

for each, we get different results. The different implementations of `parse`

that get called treat the same code differently.

This kind of approach allows us to choose serializations wisely, too. Suppose we want to serialize lists. A naive approach might be like so:

```
instance Serializable a => Serializable [a] where
serialize [] = [0]
serialize (x:xs) = [1] ++ serialize x ++ serialize xs
parse (0:rest) = Just ([], rest)
parse (1:rest) = case parse rest of
Nothing -> Nothing
Just (x,rest') -> case parse rest' of
Nothing -> Nothing
Just (xs,rest'') -> Just (x:xs, rest'')
```

But this isn't a very economical use of space. It requires one code, a single `1`

, for *every* cons node in the list. That is to say, if we have 100 elements in the list, we have 100 `1`

codes! If we used this to serialize `[1,2,3,4]`

, we would get `[1,1,1,2,1,3,1,4,0]`

. We can be more clever. Knowing the length of the list, we can just specify upfront how many elements we have to parse:

```
instance Serializable a => Serializable [a] where
serialize xs = [length xs] ++ concat (map serialize xs)
parse (n:rest) = go n rest
where go 0 rest = Just ([],rest)
go n rest = case parse rest of
Nothing -> Nothing
Just (x,rest') -> case go (n-1) rest' of
Nothing -> Nothing
Just (xs,rest'') -> Just (x:xs,rest'')
```

If we serialize the list `[1,2,3,4]`

we'd get just `[4,1,2,3,4]`

, with the first number representing the length of the list, and each subsequent one being just the appropriate code. If we serialized a list of lists, such as `[[0,1],[2]]`

, we'd get out `[2,2,0,1,1,2]`

, with the first and second `2`

s representing the lengths of the overall list and the first sublist, respectively, but the third `2`

is the element of the sublist.

### Monadic Parsing

The cascades of case expressions are a bit obnoxious, so it would be nice to avoid them if possible. The type of the `parse`

method happens to be monadic in `a`

, so we can define a nice interface:

```
data Parser a = Parser { runParser :: [Int] -> Maybe (a,[Int]) }
instance Functor Parser where
fmap f ps = Parser $ \codes ->
case runParser ps codes of
Nothing -> Nothing
Just (x, codes') -> Just (f x, codes')
instance Monad Parser where
return x = Parser $ \codes -> Just (x,codes)
ps >>= f = Parser $ \codes ->
case runParser ps codes of
Nothing -> Nothing
Just (x, codes') -> runParser (f x) codes'
failed :: Parser a
failed = Parser $ \_ -> Nothing
getCode :: Parser Int
getCode = Parser $ \codes ->
case codes of
[] -> Nothing
(x:xs) -> Just (x,xs)
```

With this interface, you can now avoid these cascading case expressions, because the `readCode`

function does it for you. We'll redefine `Serializable`

to use this instead, and give instances for `Bool`

, `Int`

, and `Tree`

:

```
class Serializable a where
serialize :: a -> [Int]
parse :: Parser a
deserialize :: Serializable a => [Int] -> Maybe a
deserialize codes = case runParser parse codes of
Just (x,[]) -> Just x
_ -> Nothing
instance Serializable Bool where
serialize True = [0]
serialize False = [1]
parse = do c <- getCode
case c of
0 -> return True
1 -> return False
_ -> failed
instance Serializable Int where
serialize n = [n]
parse = getCode
instance Serializable a => Serializable (Tree a) where
serialize (Leaf x) = [0] ++ serialize x
serialize (Branch l r) = [1] ++ serialize l ++ serialize r
parse = do c <- getCode
case c of
0 -> do n <- parse
return (Leaf n)
1 -> do l <- parse
r <- parse
return (Branch l r)
_ -> failed
```

We can even take advantage of the `Applicative`

and `Alternative`

interfaces, once we've implemented the instances, together with some parser combinators, to make it even cleaner:

```
readCode :: Int -> Parser ()
readCode c = do c' <- getCode
if c == c'
then return ()
else failed
instance Serializable Bool where
serialize True = [0]
serialize False = [1]
parse = readCode 0 *> return True
<|> readCode 1 *> return False
instance Serializable a => Serializable (Tree a) where
serialize (Leaf x) = [0] ++ serialize x
serialize (Branch l r) = [1] ++ serialize l ++ serialize r
parse = readCode 0 *> (Leaf <$> parse)
<|> readCode 1 *> (Branch <$> parse <*> parse)
```

We can define further monadic interfaces to make the construction of serializations nicer, too. This would be preferable when using a representation such as byte strings.

If you have comments or questions, get it touch. I'm @psygnisfive on Twitter, augur on freenode (in #languagengine and #haskell). Here's the HN thread if you prefer that mode.