Caching has become the primary way to scale data-intensive enterprise applications. When objects are distributed across caches, classes should define consistent hash functions, writes Tom White in a recent article explaining the benefits of consistent hashing.
Most distributed caching frameworks are based on some form of mapping of an object to a cache node. Such mappings are performed by a hash function. To understanding how to implement efficient caching, it is important to grasp the benefits of consistent hashing, argues Tom White in a recent article, Consistent Hashing.
While such hashing algorithms are likely implemented as part of a client of a caching library, such as memcached, developers need to still understanding how to write consistent hashing for their objects, writes White:
Consistent hashing is needed to avoid swamping your servers... The need for consistent hashing arose from limitations experienced while running collections of caching machines—web caches, for example. If you have a collection of n cache machines then a common way of load balancing across them is to put object o in cache machine number hash(o) mod n.
This works well until you add or remove cache machines (for whatever reason), for then n changes and every object is hashed to a new location. This can be catastrophic since the originating content servers are swamped with requests from the cache machines. It's as if the cache suddenly disappeared. Which it has, in a sense.
Consistent hashing is used to avoid that sort of situation, and to aid in the efficient distribution of objects to caches:
Consistently maps objects to the same cache machine, as far as is possible, at least... The hash function actually maps objects and caches to a number range. This should be familiar to every Java programmer - the hashCode method on Object returns an int, in order for consistent hashing to be effective it is important to have a hash function that mixes well. Most implementations of Object's hashCode do not mix well - for example, they typically produce a restricted number of small integer values... MD5 hashes are recommended here.
In the article, White discusses a popular algorithm and its Java implementation to provide consistent hashing for objects.
What sort of guidelines do you follow when defining the hashCode() methods of your classes?
I've long followed the approach that IIRC was recommended by Josh Bloch in "Effective Java. That is, when multiple fields are part of the identity you use a an arbitrary distinct and 'reasonably large' prime number to multiply each field's hash (after the first field) and sum the results. For example:
return x + (57 * y);
The idea being that (1,2) and (2,1) will now produce different hashes and because the we used a prime number, there are less combinations that collide.
From the article, I think this creates hashes that "mix well". Is that correct?
on the face of it, i think this article is confusing because - as far as my little brain can tell - the "mod n" step (for e.g. web caching) has very little to do with the "hash(o)" step (for any call to Object.hashCode(), be it for web caching or just toString() or logging). fixing your hash(o) isn't going to do anything if you don't fix your "mod n". which one is the main point under discussion?
> I've long followed the approach that IIRC was recommended > by Josh Bloch in "Effective Java. That is, when multiple > fields are part of the identity you use a an arbitrary > distinct and 'reasonably large' prime number to multiply > each field's hash (after the first field) and sum the > results.
This works well for creating a good distribution of objects into a hashed collection, but there is a danger with it as well. If any of the values of the object which contribute to the hash code change once the object is in the collection, that object will get "lost" - it will still be there, but in a different bucket than it should, so the collection will not be able to find it.
> This works well for creating a good distribution of > objects into a hashed collection, but there is a danger > with it as well. If any of the values of the object which > contribute to the hash code change once the object is in > the collection, that object will get "lost" - it will > still be there, but in a different bucket than it should, > so the collection will not be able to find it.
Yes, I am aware of this problem. It's only true if you use mutable values in the hash. It seemed to me from the article that this would be really hard to deal with with regard to distributed hashing so I assumed that the objects would need to have static identities.
I suppose you could use only an immutable subset of the values used in equals in your hash assuming there are any. Another solution I've used in the past is to make the collection an observer on the object and move it to the correct bucket automatically when the hashcode changes. Perhaps something similar could be done here.
> on the face of it, i think this article is confusing > because - as far as my little brain can tell - the "mod n" > step (for e.g. web caching) has very little to do with the > "hash(o)" step (for any call to Object.hashCode(), be it > for web caching or just toString() or logging). fixing > your hash(o) isn't going to do anything if you don't fix > your "mod n". which one is the main point under discussion?
I think the main point is how the objects are assigned. The side point about hashcodes is, from what I can tell, made because the 'circular' approach to distribution will perform worse than a mod approach for poorly distributed hashes.
Consistent hashing, while a step forward from the hashing performed by Memcached (etc.), is not the state of the art. Dynamic partitioning provides all of the benefits of consistent hashing, yet allows for a smooth (lossless) transition from one cluster view (specific set of members) to another, with no loss of data.