Backyard Hotrodding C++

by Walter Bright

May 23, 2006

Summary

Ever feel the need for speed? Real speed? If you're careful, you can get it without making a mess of things.

Backyard Hot-Rodding C++ Attend a seminar or read any book or article on C++, and there will be presented more or less the "correct" way to write C++ programs (or at least the way the author feels is correct). Certain practices are deprecated as "buggy", "sloppy", "non-portable", "undefined", "illegal" or just plain "bad." We are exhorted not to do such things, at the risk of bugs, being fired, or the ridicule of our peers.

But to take an analogy from drag racing, sometime what matters is getting across the finish line the fastest. How you did it does not matter. Certainly, everything and anything has been tried by backyard hotrodders.

What are some things that can be done with C++, that a good backyard hotrodder with greasy overalls and a pack of cigs rolled in his shirtsleeve would do, to get a program speeding over the finish line first?

The rules of this game are simple—to be a backyard C++ hotrodding modification, it's got to make the program faster than the "correct" way.

Warning: use of these techniques is rumored to be illegal in at least 5 states. They may cause you to be fired. C++ gurus will heap contempt upon your head. You'll get an "F" from your professor. Your warranty will be voided.

So let's pop the hood and slip that bottle of nitrous in.

Pointer Hydration

Let's say we have a large, in-memory data structure. We'd like to save this data structure into a disk file, and in some other invocation of the program, read it back in and reconstitute the data structure.

The proper way to do this is to serialize the data structure, object by object, into some disk friendly format. The format is then read, parsed, and object by object, it gets recreated back in memory. Each object needs a serialize function, where one must figure out how to represent each member on disk, and a constructor that can build a replica of that object from the serialized data. When the data structure is non-trivial, with cycles, pointers, shared objects, etc., this is not a simple task. Furthermore, it is inherently slow. Each object must be broken down into its serial representation, and when reconstituting, each object must be allocated, constructed, and put back together.

But the data structure is already in memory, in a sequence of bytes. Couldn't that just be copied to disk, and read back in again? Yes, if we're willing to step outside of defined behavior.

The first thing to do is organize the data structure so it is entirely contained within a known block or sequence of blocks of memory. This means it may not contain any pointers or references to objects outside of itself. That also means no references to string literals, which are stored in the static program data, or any other static data.

Next, create a custom allocator/deallocator for the objects that go into the data structure, so that they are allocated within a known block or sequence of blocks, and not mixed up with other allocations that are not to be saved.

So, those blocks can then just be rolled out directly into disk files, and rolled back in. Right? Nope, because there are pointers. When it's rolled back in, it's probably not going to wind up at the same address, and the pointers will all be pointing to the wrong place.

We need a way to dehydrate the pointers when writing the blocks to disk, and hydrate them (just add water!) when reading a block back in, so they become real pointers again. This can be done by walking the data structure, and for each pointer in each object, convert that pointer to an address to an offset from the start of the block. Each class gets a dehydrate function, which handles all the pointers in its objects.

But there's a problem. If the data structure is cyclic, or has multiple pointers to the same object, we need a way to determine if a pointer has been dehydrated already or not. Just comparing the pointer value to see what range it lies in is not enough, as an offset into the block(s) may overlap the physical addresses the block is mapped into.

The solution lies in noticing that allocators typically allocate data that is aligned on 2 or 4 byte boundaries (and since we're writing custom allocator for this, we can ensure this property). That means, for a valid pointer, the least significant bit will always be 0. We can then define a dehydrated pointer as being odd, and a hydrated pointer as being 1. Dehydrating a pointer then becomes:

void ptr_dehydrate(void **p)
{
    if (*p)
       *(long *)p |= 1;
}

and hydrating it becomes:

int isdehydrated(void *p)
{
    return ((long)p & 1) != 0;
}

void ptr_hydrate(void **p)
{
    if (isdehydrated(*p))
       *(char **)p -= ptr_adjust;
}

and for a class:

class Foo
{
    void *p;
    Bar *o;

    virtual void dehydrate()
    {
        ptr_dehydrate(&p);
        if (o && !isdehydrated(o))
        {
            o->dehydrate();
            ptr_dehydrate(&o); 
        }
    }

    virtual void hydrate()
    {
        ptr_hydrate(&p);
        if (isdehydrated(o))
        {
            o->hydrate();
            ptr_hydrate(&o);
        }
    }
}

In practice, this works out really fast. But since it is "bad" C++ code, there are some problem areas that must be avoided or accounted for:

Pointers into the middle of strings may already be odd, so this technique won't work.
Ensure there are no references to static data or data outside of the block(s).
Any objects with virtual functions, or derived from a class with virtual functions, are going to have a vptr member which points to a static vtbl array. These are all part of the implementation-defined areas of any C++ class implementation. It's likely that the vptrs will not be pointing to the right vtbl when the object is reconstituted. The ways to deal with this is to 1) avoid using such classes or 2) when the object is being hydrated, jam the right vptr value.
The code will have to be adjusted so that the test/set of the least significant bit of the pointer really is the least significant bit.

Vptr Jamming

Vptrs are one of the typical under-the-hood implementation details of a class. Virtual functions are often implemented by adding a hidden member called a vptr which points to a table of functions called the vtbl[]. A virtual function call is performed by using the vptr to find the vtbl[], and then calling the function at a specific index into that vtbl[]. Therefore, the polymorphic behavior of a class object is controlled by where the hidden vptr member is pointing. The vptr member is set when the object is constructed by some hidden code added by the compiler to every constructor for that object's class.

So, by manipulating the vptr ourselves, we can control the behavior of an object, even change its type, without calling a constructor. This technique is called vptr jamming.

For example, consider a collection class which needs to be very fast. Most of the time, it will be used in a single-threaded manner, but sometimes, in a multithreaded manner. The program can go back and forth between single and multithreaded more than once during execution. It's got to run as fast as possible, so in single-threaded mode the time spent to do locks is unaffordable. So, our class might look like:

struct Collection
{
    ... members of the collection ...

    virtual void foo() = 0;
};

struct SingleThreadedCollection : Collection
{
    void foo()
    {
        ... optimized for single threaded ...
    }
};

struct MultiThreadedCollection : Collection
{
    void foo()
    {
        ... synced for multithreaded ...
    }
};

The desire is to switch back and forth between single and multithreaded operations without having to destruct and reconstruct the object—we want to dynamically change the behavior by switching (i.e. jamming) the vptr. Here's how that would look:

struct Collection
{
    ... members of the collection ...

    virtual void foo() = 0;

    void toSingle();
    void toMulti();
};

struct SingleThreadedCollection : Collection
{
    static SingleThreadedCollection tmp;

    void foo()
    {
        ... optimized for single threaded ...
    }
};

struct MultiThreadedCollection : Collection
{
    static MultiThreadedCollection tmp;

    void foo()
    {
        ... synced for multi threaded ...
    }
};

SingleThreadedCollection SingleThreadedCollection::tmp;
MultiThreadedCollection  MultiThreadedCollection::tmp;

void Collection::toSingle()
{
    *(void **)this = *(void **)&SingleThreadedCollection::tmp;
}

void Collection::toMulti()
{
    *(void **)this = *(void **)&MultiThreadedCollection::tmp;
}

The assignment in the toXxxx() functions gets the value of the right vptr from the static temporary tmp created for just that purpose, and jams it into the vptr location in *this. For most compilers, the vptr is at offset 0 of the struct. For the rest, this code will have to be tweaked to account.

Naturally, there are problems that must be avoided or accounted for here as well:

All the class types which are to be jammed must be of the same size, or at least, the allocated size must be the size of the largest.
Different compilers, and even different versions of the same compiler, will store the vptr at different offsets within the object instance. A bit of simple testing and looking at the generated assembler will quickly find it.
Vptr jamming assumes that, for each of the class types to be jammed, the vptr will be at the same offset. This is true of all the compilers I tried it on, but there is certainly no guarantee that this is true.

RTTI Sniping

As observed with vptr jamming, the polymorphic behavior of an object instance is controlled by what vtbl[] its vptr is pointing too. It's not a big leap from that to realize that by testing the value in the vtbl[], the type of the object can be determined.

The usual way to determine the derived type of an object is by doing a dynamic_cast. If the dynamic_cast to a derived class succeeds, it returns a pointer to the derived class. If it fails, it returns NULL:

dynamic_cast is slooooww. For example:

struct A
{
    virtual int foo();
};

struct B : A
{
    int foo();
};

int test(A *a)
{
    return dynamic_cast<B*>(a) != 0;
}

Here's the generated assembly code for the test to see if a is really an instance of B:

mov     EAX,4[ESP]
test    EAX,EAX
je      L24
push    0
push    offset FLAT:___ti?AUB@@
push    offset FLAT:___ti?AUA@@
push    EAX
mov     ECX,[EAX]
push    dword ptr -4[ECX]
call    near ptr ?__rtti_cast@@YAPAXPAX0PBD1H@Z
add     ESP,014h
jmp short       L26
L24:    xor     EAX,EAX
L26:    neg     EAX
sbb     EAX,EAX
neg     EAX
ret

There are lots of instructions being executed, and a function call. It also relies on RTTI being generated for the class, which is bbllooaatt.

If only we could snipe the RTTI and figure out the type directly. If we've got the need for speed, we can do the following:

B tmp;

int test(A *a)
{
    return *(void**)a == *(void**)&tmp;
}

All this does is compare the vptr in a with the vptr in tmp. Most compilers put the vptr as the first member in a class most of the time, so this will work. When it doesn't, adjust the offset to a and &tmp to match.

The generated assembler code looks like:

mov     EAX,4[ESP]
mov     ECX,[EAX]
cmp     ECX,?tmp@@3UB@@A
mov     EAX,1
je      L15
xor     EAX,EAX
L15:    ret

Holy hotrod, Batman! That brought the test for the type down to two instructions. We can even do slightly better. The Digital Mars C++ compiler has special support for RTTI sniping with the __istype pseudo member function:

int test(A *a)
{
    return a->__istype(B) != 0;
}

and we're down to one instruction:

mov     EAX,4[ESP]
cmp     dword ptr [EAX],offset FLAT:??_QB@@6B@[4]
mov     EAX,1
je      L13
xor     EAX,EAX
L13:    ret

The obvious question is, why doesn't dynamic_cast produce the short, fast code? The answer is that RTTI sniping only works if the class type being tested for is the most derived class in the class heirarchy (because that determines the vtbl[]), whereas dynamic_cast needs to work for any derived class.

Once again, there are problems with RTTI sniping:

Some implementations are able to share vtbl[]s between classes, so the vptr for class B and class A, where B is derived from A, point to the same value. This clever compiler optimization must be defeated, sometimes it can be via a switch to "turn on RTTI", or by some other switch. Worst case, avoid using RTTI sniping between classes derived from one another.
Some (generally older) compilers can generate multiple vtbl[]s for the same class, so that two vptrs can hold different addresses, but still be the same type. Fortunately, such compilers are rarely used these days. But the problem can still crop up if one DLL generates one instance while another DLL generates another. The moral is to have all the constructors for a particular object implemented in one source file, and have that be only in one DLL, not many.

The Counterfeit `this`

The two conventional methods for hiding the implementation of a class are:

Declare the implementation as private:
```
#include "implementation.h"

class Foo
{
private:
    ... the implementation ...

// the interface
public:
    void bar()
    {
        ... manipulate the implementation ...
    }
};
```
The trouble with this, of course, is that the implementation is still there with its bare face hanging out, and in order to compile it, every irrelevant thing that the implementation needs has to be in scope, too.

Use the PIMPL idiom as described by Herb Sutter, where the class contains a pointer to the implementation of the class:

// User sees this class definition

class Implementation;       // stub definition

class Foo
{
private:
    Implementation *pimpl;

// the interface
public:
    Foo();

    void bar();
};

// Separate, hidden version of Foo

#include "implementation.h"

Foo::Foo() : pimpl(new Implementation())
{
}

void Foo::bar()
{
    pimpl->bar();
}

This succeeds in hiding the implementation details, at the cost of another layer of allocation and an extra object instance.

But there's a way to hide the implementation completely without having an extra object. The idea is to counterfeit the this pointer, so that the user thinks it is one type, but the implementation knows it is another:

// User sees this class definition
class Foo
{
// the interface
public:
    Foo *factory(); // create and initialize an instance

    void bar();
};

// Separate, hidden version of Foo

#include "implementation.h"

Foo *Foo::factory()
{
    return reinterpret_cast<Foo *>new Implementation();
}

void Foo::bar()
{
    (reinterpret_cast<Implementation *>this)->bar();
}

The reinterpret_cast is doing the dirty work of counterfeiting the type of the object from Implementation to Foo and back again.

Caveats:

The class Foo cannot have any data members, even hidden ones like a vptr. Therefore, it cannot have any virtual functions.
Foo cannot have any constructors, because we aren't constructing a real Foo, only a counterfeit one.

Conclusion

C++ provides great tools for under the hood optimization, but its use is uniformly and actively discouraged. If you're willing to accept that you're going outside all recommended practice, there are some neat things to be done to hotrod your C++ application.

These techniques are also applicable to the D programming language[1].

Sometimes, you just feel the need for speed.

Notes and References

http://www.digitalmars.com/d/index.html.

Talk back!

Have an opinion? Readers have already posted 10 comments about this article. Why not add yours?