Even though extprot supports arbitrarily complex data types (with structures, lists, arrays, tuples and disjoint unions), the encoding is reasonably simple. A surprisingly efficient (relative to Marshal.load, more on that below) "universal decoder" takes under 100 lines of Ruby code. It is "universal" in the sense that it can decode any extprot message without access to the protocol definition, by virtue of the format being self-describing. Being binary is not scary anymore.
A simple protocol / serialization format
This example is an adaptation of the one found in the Protocol Buffers tutorial for Python. The main difference is that there is no need to define a field ID in extprot (fields are encoded positionally). extprot's abstract syntax is more succinct, too:
type optional 'a = Unset | Set 'a
type phone_type = Mobile | Home | Work
message person = {
name : string;
id : int;
email : optional<string>;
phones : [ (string * phone_type) ]
}
message address_book = { persons : [ person ] }
Let's create a person; I'll serialize it from OCaml, and deserialize it in Ruby, using the "universal decoder". Here's how I generate the record from the OCaml toplevel (REPL in OCaml parlance):
(The lines starting with "#" are my input, terminated by ";;"; the ones following them are the response by the REPL.)
# let p = { Person.name = "John Doe"; id = 1234; email = Set "jdoe@example.com";
phones = ["555-4321", Phone_type.Home] };;
val p : Person.person =
{Person.name = "John Doe"; Person.id = 1234;
Person.email = Optional.Set "jdoe@example.com";
Person.phones = [("555-4321", Phone_type.Home)]}
# Std.output_file ~filename:"person.dat" ~text:(Extprot.Conv.serialize Person.write_person p);;
- : unit = ()
This generates a 54-byte file with the serialized person:
(I have redefined the #inspect method in the classes representing tuples/structures and constants.)
Comparison to Marshal
The above structure, which took 54 bytes in the extprot serialization, weighs 145 bytes when serialized with Marshal.dump. The reason is that the latter includes the class and field names (which I argue is a bad thing to do, as it exposes implementation details) and doesn't use particularly efficient encodings of integers and other values.
In order to compare the relative speed of Marshal.load (written in C) and the naïve extprot decoder written in Ruby (suboptimal and not optimized, for clarity --- the code can be found below), I generated an address book with over 18000 random person records. Here's what I measured:
Serialization size (bytes)
Deserialization time
Marshal.load(from string) (Ruby core method in C)
3527449
1.29s
Marshal.load(from IO)
3527449
1.65s
naïve extprot decoder, from IO (pure Ruby)
1859128
3.03s
Marshal.load being only a bit over twice as fast as the naïve decoder (in the most favorable case) was quite a surprise. Keep in mind that it's C code versus suboptimal Ruby code. A posteriori, I can think of a few possible reasons (memory allocation being slow, Marshal's rather inefficient encoding, etc.).
A universal decoder in under 100 lines of Ruby
First of all, let's create some classes for the complex data types, and a couple exceptions for errors found while decoding: