The Artima Developer Community
Sponsored Link

Ruby Buzz Forum
Surprisingly efficient deserialization (vs. Marshal) in pure Ruby

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Eigen Class

Posts: 358
Nickname: eigenclass
Registered: Oct, 2005

Eigenclass is a hardcore Ruby blog.
Surprisingly efficient deserialization (vs. Marshal) in pure Ruby Posted: Jan 23, 2009 7:50 AM
Reply to this message Reply

This post originated from an RSS feed registered with Ruby Buzz by Eigen Class.
Original Post: Surprisingly efficient deserialization (vs. Marshal) in pure Ruby
Feed Title: Eigenclass
Feed URL: http://feeds.feedburner.com/eigenclass
Feed Description: Ruby stuff --- trying to stay away from triviality.
Latest Ruby Buzz Posts
Latest Ruby Buzz Posts by Eigen Class
Latest Posts From Eigenclass

Advertisement

Even though extprot supports arbitrarily complex data types (with structures, lists, arrays, tuples and disjoint unions), the encoding is reasonably simple. A surprisingly efficient (relative to Marshal.load, more on that below) "universal decoder" takes under 100 lines of Ruby code. It is "universal" in the sense that it can decode any extprot message without access to the protocol definition, by virtue of the format being self-describing. Being binary is not scary anymore.

A simple protocol / serialization format

This example is an adaptation of the one found in the Protocol Buffers tutorial for Python. The main difference is that there is no need to define a field ID in extprot (fields are encoded positionally). extprot's abstract syntax is more succinct, too:

type optional 'a = Unset | Set 'a
type phone_type = Mobile | Home | Work

message person = {
  name : string;
  id : int;
  email : optional<string>;
  phones : [ (string * phone_type) ]
}

message address_book = { persons : [ person ] }

Let's create a person; I'll serialize it from OCaml, and deserialize it in Ruby, using the "universal decoder". Here's how I generate the record from the OCaml toplevel (REPL in OCaml parlance):

(The lines starting with "#" are my input, terminated by ";;"; the ones following them are the response by the REPL.)

# let p = { Person.name = "John Doe"; id = 1234; email = Set "jdoe@example.com";
            phones = ["555-4321", Phone_type.Home] };;
val p : Person.person =
  {Person.name = "John Doe"; Person.id = 1234;
   Person.email = Optional.Set "jdoe@example.com";
   Person.phones = [("555-4321", Phone_type.Home)]}
# Std.output_file ~filename:"person.dat" ~text:(Extprot.Conv.serialize Person.write_person p);;
- : unit = ()

This generates a 54-byte file with the serialized person:

00000000  01 34 04 03 08 4a 6f 68  6e 20 44 6f 65 00 a4 13  |.4...John Doe...|
00000010  01 13 01 03 10 6a 64 6f  65 40 65 78 61 6d 70 6c  |.....jdoe@exampl|
00000020  65 2e 63 6f 6d 05 0f 01  01 0c 02 03 08 35 35 35  |e.com........555|
00000030  2d 34 33 32 31 1a                                 |-4321.|
00000036

And here's the output by the universal decoder in Ruby:

T0 ["John Doe", 1234, T0 ["jdoe@example.com"], [T0 ["555-4321", E1]]]

(I have redefined the #inspect method in the classes representing tuples/structures and constants.)

Comparison to Marshal

The above structure, which took 54 bytes in the extprot serialization, weighs 145 bytes when serialized with Marshal.dump. The reason is that the latter includes the class and field names (which I argue is a bad thing to do, as it exposes implementation details) and doesn't use particularly efficient encodings of integers and other values.

In order to compare the relative speed of Marshal.load (written in C) and the naïve extprot decoder written in Ruby (suboptimal and not optimized, for clarity --- the code can be found below), I generated an address book with over 18000 random person records. Here's what I measured:

Serialization size (bytes) Deserialization time
Marshal.load(from string) (Ruby core method in C) 3527449 1.29s
Marshal.load(from IO) 3527449 1.65s
naïve extprot decoder, from IO (pure Ruby) 1859128 3.03s

Marshal.load being only a bit over twice as fast as the naïve decoder (in the most favorable case) was quite a surprise. Keep in mind that it's C code versus suboptimal Ruby code. A posteriori, I can think of a few possible reasons (memory allocation being slow, Marshal's rather inefficient encoding, etc.).

A universal decoder in under 100 lines of Ruby

First of all, let's create some classes for the complex data types, and a couple exceptions for errors found while decoding:

Read more...

Read: Surprisingly efficient deserialization (vs. Marshal) in pure Ruby

Topic: The problems with Ruby's serialization (Marshal), and how extprot addresses them Previous Topic   Next Topic Topic: Questions To Ask an Interviewer

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use