Extensible Serialization

posted by Darryl on 1 Apr 2015

It's frequently useful to serialize data, for storage or transmission over a network, or whatever other reasons. For simple, well-circumscribed problems, serialization can be done easily with a Forth-like schema, but a more flexible serialization library that supports extensibility, such as Haskell's Data.Binary, can't use this technique. In this post, I'm going to explore why, and show at least one way to implement something roughly like Data.Binary.

Forth-y Serialization

Let's suppose we have a data type for binary trees with no labels, such as this:

data TreeN = LeafN | BranchN TreeN TreeN

Let's serialize this to a list of integers. Normally we'd use some other representation like byte strings or something of that nature, but for simplicity of the example, integers will suffice.

To do this, we need to represent each constructor — LeafN and BranchN — by a number, say, 0 and 1. If we do an enumeration of node constructors via a post-order traversal, we can produce a Forth-like, or RPN-like, representation:

serializeN :: TreeN -> [Int]
serializeN LeafN         = [0]
serializeN (BranchN l r) = serializeN l ++ serializeN r ++ [1]

So now, if we do some serializations, we get what we expect:

serializeN LeafN == [0]

serializeN (BranchN LeafN LeafN) == [0,0,1]

serializeN (BranchN LeafN (BranchN LeafN LeafN)) == [0,0,0,1,1]

serializeN (BranchN (BranchN LeafN LeafN) LeafN) == [0,0,1,0,1]

Deserializing this is relatively simple, using a stack-based approach. Reading a 0 pushes a LeafN onto the stack, reading a 1 pops two items from the stack, combines them with BranchN, and pushes the result. This process could fail, of course, so we'll account for that.

deserializeN :: [Int] -> Maybe TreeN
deserializeN codes = go [] codes
  where go [t]         []       = Just t
        go stack       (0:rest) = go (LeafN:stack) rest
        go (r:l:stack) (1:rest) = go (BranchN l r:stack) rest
        go _           _        = Nothing

And naturally, if we deserialize the above serialized trees, we get them right back, wrapped in Just.

Now let's suppose we want to serialize binary trees with boolean values in the leaves. Here's our type:

data TreeB = LeafB Bool | BranchB TreeB TreeB

Serializing is easy. We'll use 0 and 1 as before, and since Bool has two constructors, we'll use 2 and 3 for them:

serializeB :: TreeB -> [Int]
serializeB (LeafB True)  = [2,0]
serializeB (LeafB False) = [3,0]
serializeB (BranchB l r) = serializeB l ++ serializeB r ++ [1]

We can now write a deserializer easily enough, with the help of a tagging type for the stack:

data StackTagB = BoolTag Bool | TreeBTag TreeB

deserializeB :: [Int] -> Maybe TreeB
deserializeB codes = go [] codes
  where go [TreeBTag t] [] = Just t
        go stack (2:rest) = go (BoolTag True:stack) rest
        go stack (3:rest) = go (BoolTag False:stack) rest
        go (BoolTag b:stack) (0:rest)
          = go (TreeBTag (LeafB b):stack) rest
        go (TreeBTag r:TreeBTag l:stack) (1:rest)
          = go (TreeBTag (BranchB l r):stack) rest
        go _ _ = Nothing

This works, it's fine and dandy, so let's move on to another, more complex example: trees with integers in the leaves. Actually let's just say non-negative integers, for simplicity.

data TreeI = LeafI Int | BranchI TreeI TreeI

How might we write the serialization algorithm? Well, we need to store the integers somehow in the serialization in a way that distinguishes between the stored integers, and the integers-as-codes-for-constructors. But that's easy: we can shift them up by two, so that no conflicts are created:

serializeI :: TreeI -> [Int]
serializeI (LeafI n) = [n+2,0]
serializeI (BranchI l r) = serializeI l ++ serializeI r ++ [1]

Now deserialization, with a similar stack tag trick:

data StackTagI = IntTag Int | TreeITag TreeI

deserializeI :: [Int] -> Maybe TreeI
deserializeI codes = go [] codes
  where go [TreeITag t] [] = Just t
        go stack (n:rest)
          | n > 1 = go (IntTag (n-2):stack) rest
        go (IntTag n:stack) (0:rest)
          = go (TreeITag (LeafI n):stack) rest
        go (TreeITag r : TreeITag l : stack)  (1:rest)
          = go (TreeITag (BranchI l r):stack) rest
        go _ _ = Nothing

Now we're starting to see, however, that this is requiring more and more special coding. With booleans, we could just assign codes to the boolean values since there's only two. With integers, we need a way of shifting all of them out of the way of the codes for the constructors of the trees. As things get more complex, this schema will have to get more complex. And of course, more fragile.

Now what if we wanted a parametric tree, with whatever kind of data in the leaves, like so:

data Tree a = Leaf a | Branch (Tree a) (Tree a)

How could we possibly serialize that? We'd need serializers for the leaf data not just the tree as a whole. So we need to define a serialization system that's extensible. We could make our serialize and deserialize functions methods of a type class:

class Serializable a where
  serialize :: a -> [Int]
  deserialize :: [Int] -> Maybe a

If we now try to implement an appropriate instance for Tree, parallel our definition for TreeI, we find it's not quite acceptable:


instance Serializable a => Serializable (Tree a) where
  serialize (Leaf x)     = serialize x ++ [0]
  serialize (Branch l r) = serialize l ++ serialize r ++ [1]
  deserialize codes = ...

When we first encounter a 0, how do we know it means Leaf instead of something else? When were serializing a type like TreeB or TreeI with a closed serialization function, we could pick the codes we wanted to use for each constructor, and even if the types were complex, we could at least circumscribed them, and determine what all the constructors could be, so there wouldn't be confusion. But since Tree is parametric, there's no way to force other instances of Serializable to not use 0 as a code. So maybe the first 0 means Leaf, or maybe it means the integer 0 or maybe it means something else entirely!

Put another way, what if we also had the instance


instance Serializable Int
  serialize n = [n]
  deserialize n = ...

Then we would get 0 and 1 occurring in serializations of Tree Int, but they could come either from the codes for Leaf and Branch, or from the codes for the integers 0 and 1! The add-2 trick won't work as before, because who again, who knows how many other types need serialization? Maybe it's actually add-3, or add-a-million, or add-infinity if we're serializing real numbers?

A Better Approach: Real Parsing

The problem we face is that, because of extensibility, the same code could mean different things depending on what type it came from. So, the only way to ensure we can deserialize is if we know what type the code should correspond to.

With a post-order serialization, this isn't easy, because the next code in the serialization corresponds to the leftmost node in the data that it comes from, which could be arbitrarily deep and of arbitrary type. But with a pre-order serialization, the next code in the serialization corresponds to the topmost node in the data that it comes from.

If we work top-down, we always know what what kind of data we're expecting, so we can use that to tell us what the code represents. If we're expecting a tree, and see the leaf code next, it corresponds to a leaf. If we're expecting a tree and see a branch code next, it corresponds to a branch and we should expect two more trees. If we're expecting an integer, and see any integer at all, even the leave or branch codes, we still treat it as an integer not a tree.

This does mean, however, that it's harder to know when we're done deserializing. Before, we were done when the code list was empty and there was nothing left to process, but now, it's a little harder. We'll still wait for the code list to be empty, but it has to be empty at the end of the total deserialization, so we'll instead define deserialization in terms of a more fundamental, partial-deserializer or parser, that works on difference lists.

class Serializable a where
  serialize :: a -> [Int]
  parse :: [Int] -> Maybe (a,[Int])

deserialize :: Serializable a => [Int] -> Maybe a
deserialize codes = case parse codes of
                      Just (x,[]) -> Just x
                      _           -> Nothing

The parse function will either return as much of a deserialization as possible, together with a remainder of the input that was not part of the serialized data, or it will fail because a code is incorrect for the expected type.

Here is an implementation for booleans and integers:

instance Serializable Bool where
  serialize True = [0]
  serialize False = [1]
  parse (0:rest) = Just (True,rest)
  parse (1:rest) = Just (False,rest)
  parse _        = Nothing

instance Serializable Int where
  serialize n = [n]
  parse (n:rest) = Just (n,rest)

Notice that the parse method for integers always succeeds, while for booleans it only succeeds when the next code is either 0 or 1. If we now tried to deserialize some codes for booleans or integers, we get:

deserialize [0] :: Maybe Bool
  == Just True

deserialize [0] :: Maybe Int
  == Just 0

deserialize [1] :: Maybe Bool
  == Just False

deserialize [1] :: Maybe Int
  == Just 1

deserialize [2] :: Maybe Bool
  == Nothing

deserialize [2] :: Maybe Int
  == Just 2

Or if we compare deserialization to parsing, on some non-codes:


parse [0,2] :: Maybe (Bool,[Int])
  == Just (True,[2])

deserialize [0,2] :: Maybe Bool
  == Nothing

This successfully parses because [0,2] starts with a code for True, and the remainder of the code might be for something else, but it fails to deserialize because the expectation of deserialization is that the code is for exactly a boolean, not just a boolean and maybe some other stuff.

Let's now implement trees:


instance Serializable a => Serializable (Tree a) where
  serialize (Leaf x)     = [0] ++ serialize x
  serialize (Branch l r) = [1] ++ serialize l ++ serialize r
  parse (0:rest) = case parse rest of
                     Nothing -> Nothing
                     Just (n,rest') -> Just (Leaf n,rest')
  parse (1:rest) = case parse rest of
                     Nothing -> Nothing
                     Just (l,rest') -> case parse rest' of
                       Nothing -> Nothing
                       Just (r,rest'') -> Just (Branch l r, rest'')
  parse _ = Nothing

Now we can serialize trees containing booleans, and deserialize them too, even tho they both use the same codes in different ways. Let's see how this unfolds in an example:


serialize (Leaf True)
  = [0,0]

The first 0 came from the Leaf node, and the second came from True. If we now parse:

parse [0,0] :: Maybe (Tree Bool, [Int])

Because we're trying to parse a tree, we use the method specified for Tree, which looks at the first code and recognizes it's a code for a leaf node, so it recursively tries to compute

parse [0] :: Maybe (Bool,[Int])

This recursive call is for Bool instead, and so uses that method. It recognizes that this is a code for True, and succeeds, returning Just (True,[]). The remainder is empty at this point.

We now return to the computation for parse [0,0] :: Maybe (Tree Bool, [Int]), and having successfully found a boolean, it succeeds and returns Just (Leaf True, []). The parse is finished, with an empty remainder. If we had called deserialize instead, we would have succeeded with Just (Leaf True).

Even tho we have 0 used to represent different things, because we're using different instances of Serializable for each, we get different results. The different implementations of parse that get called treat the same code differently.

This kind of approach allows us to choose serializations wisely, too. Suppose we want to serialize lists. A naive approach might be like so:

instance Serializable a => Serializable [a] where
  serialize []     = [0]
  serialize (x:xs) = [1] ++ serialize x ++ serialize xs
  parse (0:rest) = Just ([], rest)
  parse (1:rest) = case parse rest of
                     Nothing -> Nothing
                     Just (x,rest') -> case parse rest' of
                       Nothing -> Nothing
                       Just (xs,rest'') -> Just (x:xs, rest'')

But this isn't a very economical use of space. It requires one code, a single 1, for every cons node in the list. That is to say, if we have 100 elements in the list, we have 100 1 codes! If we used this to serialize [1,2,3,4], we would get [1,1,1,2,1,3,1,4,0]. We can be more clever. Knowing the length of the list, we can just specify upfront how many elements we have to parse:

instance Serializable a => Serializable [a] where
  serialize xs = [length xs] ++ concat (map serialize xs)
  parse (n:rest) = go n rest
    where go 0 rest = Just ([],rest)
          go n rest = case parse rest of
                        Nothing -> Nothing
                        Just (x,rest') -> case go (n-1) rest' of
                          Nothing -> Nothing
                          Just (xs,rest'') -> Just (x:xs,rest'')

If we serialize the list [1,2,3,4] we'd get just [4,1,2,3,4], with the first number representing the length of the list, and each subsequent one being just the appropriate code. If we serialized a list of lists, such as [[0,1],[2]], we'd get out [2,2,0,1,1,2], with the first and second 2s representing the lengths of the overall list and the first sublist, respectively, but the third 2 is the element of the sublist.

Monadic Parsing

The cascades of case expressions are a bit obnoxious, so it would be nice to avoid them if possible. The type of the parse method happens to be monadic in a, so we can define a nice interface:

data Parser a = Parser { runParser :: [Int] -> Maybe (a,[Int]) }

instance Functor Parser where
  fmap f ps = Parser $ \codes ->
                case runParser ps codes of
                  Nothing -> Nothing
                  Just (x, codes') -> Just (f x, codes')

instance Monad Parser where
  return x = Parser $ \codes -> Just (x,codes)
  ps >>= f = Parser $ \codes ->
               case runParser ps codes of
                 Nothing -> Nothing
                 Just (x, codes') -> runParser (f x) codes'

failed :: Parser a
failed = Parser $ \_ -> Nothing

getCode :: Parser Int
getCode = Parser $ \codes ->
            case codes of
              []     -> Nothing
              (x:xs) -> Just (x,xs)

With this interface, you can now avoid these cascading case expressions, because the readCode function does it for you. We'll redefine Serializable to use this instead, and give instances for Bool, Int, and Tree:

class Serializable a where
  serialize :: a -> [Int]
  parse :: Parser a

deserialize :: Serializable a => [Int] -> Maybe a
deserialize codes = case runParser parse codes of
                      Just (x,[]) -> Just x
                      _           -> Nothing

instance Serializable Bool where
  serialize True = [0]
  serialize False = [1]
  parse = do c <- getCode
             case c of
               0 -> return True
               1 -> return False
               _ -> failed

instance Serializable Int where
  serialize n = [n]
  parse = getCode

instance Serializable a => Serializable (Tree a) where
  serialize (Leaf x) = [0] ++ serialize x
  serialize (Branch l r) = [1] ++ serialize l ++ serialize r
  parse = do c <- getCode
             case c of
               0 -> do n <- parse
                       return (Leaf n)
               1 -> do l <- parse
                       r <- parse
                       return (Branch l r)
               _ -> failed

We can even take advantage of the Applicative and Alternative interfaces, once we've implemented the instances, together with some parser combinators, to make it even cleaner:

readCode :: Int -> Parser ()
readCode c = do c' <- getCode
                if c == c'
                then return ()
                else failed

instance Serializable Bool where
  serialize True = [0]
  serialize False = [1]
  parse = readCode 0 *> return True
      <|> readCode 1 *> return False

instance Serializable a => Serializable (Tree a) where
  serialize (Leaf x) = [0] ++ serialize x
  serialize (Branch l r) = [1] ++ serialize l ++ serialize r
  parse = readCode 0 *> (Leaf <$> parse)
      <|> readCode 1 *> (Branch <$> parse <*> parse)

We can define further monadic interfaces to make the construction of serializations nicer, too. This would be preferable when using a representation such as byte strings.

If you have comments or questions, get it touch. I'm @psygnisfive on Twitter, augur on freenode (in #languagengine and #haskell). Here's the HN thread if you prefer that mode.