As I explained yesterday, the caml_modify write barrier needed by OCaml's incremental + generational GC is the main reason why immutable data can often be faster than mutable state. I've played a little with it and achieved a >50% speed boost in synthetic benchmarks, which could map to up to two-digit percentage gains in actual programs (in particular, it can make Array.map, init, make and the likes over 20% faster). The mutable state part from the synthetic business entity update benchmark becomes 32% faster.
How the write barrier works
caml_modify needs to record the references from values in the major heap to the minor heap. The code is straightforward:
fp is the block slot where the (suspected) pointer is to be stored. caml_modify first checks if the destination is in the major heap with the Is_in_heap macro I'll come back to in a minute. Is_block(v) is true when v is not an immediate value. There's an interesting piece of logic that tries to avoid recording the same reference twice: if the previous reference _old_ held in fp was to a block in the minor heap, we know it's been already added to the ref table and there's no need to do it again. In the worst case, when updating N times the same slot with values in the minor and the major heap alternatingly, the reference will be recorded N/2 times --- it seems very unlikely this would happen in practice, though, as the ref table is cleared on each minor GC run.
The main problem with the above routine is that Is_in_heap is fairly slow. On 64-bit architectures it involves a lookup in a sparse hash table to see if the corresponding page is being managed by OCaml (on 32 bit, a two-level array is used). The reason why this check cannot just be implemented as a negated "is it in the minor heap" test (two pointer comparisons at most) is that it is possible to have blocks which belong neither to the minor heap nor to the major one --- iow., memory areas not managed by OCaml at all, and which are not to be traversed by the GC.
When we're not in the Phase_mark GC phase, the order of the checks can be altered so the expensive Is_in_heap test is only performed when the the value to be stored resides in the minor heap and the old one didn't before. This way, when the same slot is being updated several times, Is_in_heap will only be used once (with the exception explained above).
The resulting code, albeit quite ugly, is considerably faster. I measure the gain with two trivial programs: