Ruby Buzz Forum - Whose Variable Is It Anyway?

The more I think about Ruby in relation to other object programming languages I’ve worked with, the more I realize that there’s a continuum of static vs. dynamic typing.

Ruby fits close to one end of that continuum. Understanding this can help understand how to best use the language. I recently had a quick look at Russ Olsen’s new book Design Patterns in Ruby and looked at his section on the observer pattern. I’d just posted to ruby-talk about this pattern, how it was implemented in Smalltalk, and a more Rubyish implementation. I’ll get to that at the end of this article, but first I really feel the urge to talk about instance variables.

If we view a type as a particular interpretation of a memory layout, I see something like this

The two columns represent how the object’s innards “look” from outside the object, and inside the object (i.e. inside a method) respectively.

In Java, subject to access modifiers, instance variables, a.k.a. fields, can be directly accessed. No accessor method is required. In Smalltalk and Ruby, instance variables of an object are only accessible while executing one of the object’s methods. Although both languages provide a bypass mechanism, instance_variable_get and instance_variable_set for Ruby; instVarAt: and instVarAt:put: for Smalltalk, these are both methods, and are to be used only in “emergencies” since they break the encapsulation of the object.

Static Instance Variable Binding

By static here, I mean that the code which accesses the instance variable uses information which is statically bound by the compiler. This is a subtlety which misses a lot of today’s programmers who don’t understand what a compiler does, which is to take the textual, human written and readable source code and turn it into bits and bytes which can be executed by some form of computer. The older wiser guys might just want to skim this section.

That computer might be a real computer, like an Intel processor, or a virtual computer in the form of a software implemented virtual machine or interpreter. In the case of a real or virtual machine, there is an instruction set which gives the repertoire of the machine. The program is executed by moving step by step, instruction by instruction. Now, if we have a simple C statement like:

int a = b

Then the instruction sequence for an imaginary computer might be something like:

    load  reg2, 20(reg1)
    store reg2, 40(reg1)

Which loads the second machine register from a word which is at an address 20 bytes after the address contained in machine register 1, and then stores that value into another word offset 40 bytes from register 1. Here a and b are intended to be local temporary variables, and I’ve decided that my compiler is using register 1 to point to the current activation stack frame. Those magic numbers, 20 and 40 are computed by the compiler as part of it’s function to map variables to memory locations.

The idea that instructions might be different lengths is quite common in instruction set design. Usually some number of bits at the beginning of the instruction is used to encode an ‘op code’ like load or store, add, or subtract, etc. Other bits are used to determine the presence and format of parameters for the operation. Different instruction sets have different addressing modes, which allow memory to be addressed in various ways, such as the mode used above which addresses memory as an offset from a location held in a register. Other addressing modes might add another register used to index elements in an array, for example. Most real instruction sets have some unit of length for instructions, so for a given machine architecture, all instructions might be one or more words, or one or more bytes.

Bytecodes Are An Instruction Set Format

The term “bytecode” is simply a particular form of an instruction set, or rather a family of forms. Most people associate the term with Java, and a particular instruction set, although the term predates Java, being used by Smalltalk and probably before. It really means that the ‘machine code’ instructions are represented as a series of bytes. Many instructions are encoded by a single byte, although some require additional bytes in order to form a complete instruction. The general term bytecode simply means that the length unit for the instruction set is one byte.

Although Java and Smalltalk implementations typically use bytecode instruction sets for their virtual machines, the actual set of bytecodes differs, much as the instruction set for an Intel Core Duo 2, differs from the instruction set for a PowerPC G4.

Classical Instance Variable Binding, Smalltalk Style

Now lets look at similar code in Smalltalk. In this article, I’m using the bytecodes defined by the out of print, “Smalltalk:The Language and Its Implementation”, a.k.a. “The Blue Book”, other Smalltalk implementations might be slightly different.

    a := b

    push_iv_4            # push instance-variable #4 
                             # onto the stack
    store_and_pop_iv_6   # store the top of the stack in
                             # instance-variable #6

In Smalltalk, those magic index numbers used to access instance variables are determined when the class definition is saved. In this case b turned out to be the 4th instance variable, and a the 6th. Smalltalk bytecodes are optimized for small objects, the first 16 instance variables can all be pushed or popped with a single-byte instruction, if an object has more than 16 instance variables then those beyond the 16th need to be accessed via an extended push or store instruction, which allows up to 64 instance variables to be accessed.

In Smalltalk, although instance variables aren’t typed, they are declared in a class definition message executed when the class definition is saved. Any time a class definition is saved, the indices for the instance variables of that class, and of any subclasses are (re)computed, and any methods in the class and it’s subclasses are re-compiled to adjust the offsets. The instance variables defined in the topmost class get the first n offsets, each immediate subclasses instance variables get sequential offsets starting with the next available, and so forth.

This is why I said above that inside a Smalltalk object, i.e. within it’s methods, the object is mapped statically. Changing the instance variable definitions requires re-compilation to avoid ‘type-errors’ in the methods.

Note that those ‘emergency-only’ methods instvarAt: and instVarAt:put: map to the push_iv and store_and_pop_iv bytecodes, the first argument to both is the instance variable index. This also means that they need to be used with care, since you need to know the offset of the instance variable. Now, at least Smalltalk can tell you if you try to access a non-existant instance variable slot but it can’t tell that you’re accessing the wrong slot.

Java Field Binding

In Java, offsets are not compiled directly into the bytecodes, there’s a level of indirection. Peter Haggar, with whom I used to work at IBM wrote an article on Java bytecodes on developerworks. I hope he won’t mind if I borrow one of his examples. Here’s a simple accessor method

public String employeeName()
{
    return name;
}

    Method java.lang.String employeeName()
    0 aload_0
    1 getfield #5 <Field java.lang.String name>
    4 areturn

What this code does first is to push a reference to the current object, this, onto the stack. Then the getfield instruction uses it’s operand to replace the top two items on the stack with the value of the field. So these two byte-codes (actually 3 bytes in total) are roughly equivalent to the Smalltalk push_iv bytecode, but for two differences:

The first difference is because in Java, unlike in Smalltalk, you can directly get and set public fields outside of of the objects methods, so since the object in question isn’t implied, it has to be specified.

The second difference is to allow for separate compilation. The actual Java VM specification doesn’t dictate how fields are mapped within objects, but the abstraction is to allow this mapping to be adjusted at the time classes are loaded. If a subclass is compiled separately from it’s superclass, it might get a new starting position for it’s fields everytime it’s loaded if one or it’s superclasses has added or removed fields.

So in order to access a Java field, the compiler needs to know the type of the object containing the field. This is true whether we are inside a method or outside.

Instance Variables, The Ruby Way

In Ruby, instance variables aren’t declared, so offsets can’t be assigned statically. Instead, Ruby looks up instance variables dynamically, using the instance variable name rather than an offset. Again this matches the ‘emergency use’ messages, instance_variable_get and instance_variable_set take an instance variable name, complete with the ”@” sigil, where the Smalltalk instVarAt: methods take an integer.

In Ruby 1.8, this lookup is implemented in a fairly straightforward fashion. With a few exceptions, which I won’t take the time to go into here, a Ruby object has a pointer named iv_tbl which points to a hash table which maps the instance variable names to values. In Ruby 1.9, the implementation is a bit more clever, but the effects are the same.

So Whose Variable IS it Anyway?

Which brings us back to the title of this article. In Java and Smalltalk, every instance of a given class has the same set of instance variables, albeit each with it’s own value. The variables come into existence when they are declared, and the class is compiled or the class definition is saved.

One thing I didn’t mention in the discussion of Smalltalk is that, because traditional Smalltalk implementations don’t separate the development environment from the run-time environment, when a class definition changes, besides requiring method recompilation for the class and it’s subclasses, any existing instance variables need to be mutated to either add or remove the changed instance variables. Back when he was working on the language self, which has dynamic resolution of instance variables like Ruby, Dave Ungar used to like to kill various Smalltalk implementations by adding an instance variable to the Object class. The problem is that because we are trying to operate on the running system, the system usually trips over itself during such a change. I tried this a few weeks ago with Squeak, and although it warned me twice that I shouldn’t do that, it did try when I clicked that second “Are you sure” button, and crashed pretty quickly. Ruby does handle this as a matter of course, since instance variables are only added to individual objects when they are needed, and self inside a method really is duck-typed, actually more than duck-typed, since the needed instance variables appear just in time.

So you mentioned the Observer Pattern, What’s All This Have To Do With That

One of the things which got me thinking about this again was a thread on ruby-talk some weeks ago about Ruby garbage collection and some of the things which keep Object from being considered garbage and being collected. The Ruby GC tends to have problems if you use finalization and aren’t really careful about how you define your finalizers.

One of the classic gotcha’s in Smalltalk in this vein is the implementation of Object dependents, a.k.a. the Observer Pattern. Smalltalk provides a mechanism to add dependent objects to any other object which, when it want’s to notify it’s dependents that it has changed, can simply send itself the changed message, which in turn sends each dependent the message update: with the changed object as the argument.

This is the basis of the Model View Controller design in Smalltalk. Views register as dependents on Models, and when a model changes, any Views depending on it can react. This is the genesis of the Observer pattern from the well known gang of four Design Patterns book where Model, and View have been generalized to Subject and Observer respectively.

In Smalltalk the ability to manage a list of dependents and notify them on changes is something that every object can do, but very few objects actually use this capability. In order to avoid having an instance variable in every Smalltalk object to reference a dependents collection which is almost always empty, the default implementation actually keeps a global hash which maps objects with dependents to their dependent collection.

The problem with this default implementation is that once an object gains a dependent, the object and it’s dependent objects are permanently reachable, and therefore ineligible for garbage collection, unless the dependency is explicitly removed. As a result of this, the classes of most objects which actually have dependents reimplement the default methods to refer to the dependents collection via an instance value in the object with dependents. Squeak, for example provides a subclass of object called Model which provides such a GC friendly implementation.

Which brings me to the implementation of the observer pattern in Ruby. In his discussion of this pattern in his book, Russ Olsen provides a module which can be mixed into an object to allow it to have dependents:

module Subject
  def initialize
    @observers = []
  end

  def add_observer(&observer)
    @observers << observer
  end

  def delete_observer(observer)
    @observers.delete(observer)
  end

  def notify_observers
    @observers.each do |observer|
      observer.call(self)
    end
  end  
end

This is a nice Ruby spin on the pattern, in that the Observers can be blocks, or any other object which responds to a call method which takes the Subject as its argument.

Shortly before seeing the book, as a result of that GC thread, I’d written my own variation on this, which let’s any object be a subject, by opening up the Object class:

class Object
    def add_observer(&observer)
      (@observers ||= []) << observer
    end

    def delete_observer(observer)
      observers.delete(observer)
    end

    def notify_observers
      observers.each do |observer|
        observer.call(self)
      end
    end

    private
    def observers
      @observers || []
    end 
end

Because of the fact that Ruby only adds instance variables on the fly as needed, we get the benefit of universal support for objects to be Subjects without requiring an observers instance variable for those objects which don’t. The only cost is the potential namespace collision for the four method names.

Another Use of Dynamic Instance Variables

Recently I wrote an article for InfoQ about James Golick’s resource_controller plugin for Rails which allows you to write Rails controllers for RESTful resources which can automatically adapt for use in different resource nesting contexts. This plugin makes good use of the dynamic nature of Ruby instance variables, automatically defining different instance variables in the controller to correspond to the end resource and each of its parent resources.

Whew!

This has turned out to be a rather long article, which I’ve been meaning to write for some time. I hope that someone finds it useful, or at least interesting.


	Web Artima.com