Hi Frank,
> I would be interested in how others handle heavily
> I/O-bound jobs in a clustered environment: What if most
> of your queries, for instance, result in cache misses and
> need to hit the disk?
I'd love to give you a simple answer, but the truth is that there are so many different possible answers depending on the scenario. For example, for applications that don't have to show transactionally up-to-date data, it's far better to cache common query result sets (something you can easily do without any commercial product and without any clustering).
Products like Oracle Coherence or Terracotta come into play when you have to coordinate among multiple servers, e.g. app servers. Coherence allows the app servers to talk directly to each other (peer to peer) in order to coordinate very efficiently. In Terracotta's case, each app server establishes a client/server connection to a separate "Terracotta server" (which uses Oracle Berkeley DB for its disk storage and I/O ;-), and thus those app servers all have to do the coordination through that central server.
> Also, in this interview, Ari's example was a 100GB
> database.
Depending on your budget, a 100GB database is easily cacheable. On the other hand, depending on your load, it may run fine without much caching (or tuning ;-) at all ....
> In another Artima interview, Gil Tene from Azul suggests
> that you load your 100GB dataset into the memory of a
> single JVM, which can be accomplished using their special-
> purpose JVM.
Yes, it's pretty interesting technology, and there are certain applications which basically require Azul because they have very large heaps and cannot tolerate long GCs.
> 1. How would you approach this problem if you had a TB
> dataset? And if you had petabyte of data?
Terabyte data sets can fit in memory using data grid technology (e.g. Coherence), but that will likely be an expensive choice (about 35 servers, including full redundancy), so you'd only do that if the benefits of having a terabyte in-memory data grid were worth it. Banks use this type of architecture for systems that have lots of data and have to provide fast results (e.g. risk calcs, end of day processing, trading systems, etc.)
There are many other ways to deal with TB data sets without putting it all into RAM, but obviously I spend most of my time helping customers put everything in RAM, so I'm hardly the database expert. The real question is, does the application have a data access pattern that is so scattered and so heavy that it all has to be in RAM? If not, then a database (likely coupled with a cache in front of the database) should likely be able to handle the load fairly easily.
I'll have to skip the petabyte question, because it's simply outside of my area of expertise. While I've worked with really big databases before, they often used different approaches for OLTP (e.g. Oracle) versus OLAP, and it seemed like there was still quite a bit of technical specialization and point products used to make these systems work well.
> Loading an entire dataset in RAM doesn't seem to take
> advantage of memory hierarchies, but simply pushes the
> problem into the most expensive type of memory. What
> would be the role of flash memory, for instance, in
> this sort of application (which is cheaper than RAM)?
Flash memory seems to be an effective way to speed up databases, that's for sure. I'm not sure why it isn't more widely used at this point ....
> 2. If the data is in an RDBMS, most RDBMS products are
> pretty good at caching in memory lots of data already
Yes, but that's far different than the concept of a distributed cache. In a distributed cache, for example, when you ask for an object, you probably get it 99.9% of the time locally, and only have to go over the network .1% of the time to ask for it, and then you have it. With a database cache, when you ask for the same object, it goes to an ORM, which issues a query or three, which goes to the database, which pulls data from cache, which sends it back to the ORM, which glues together objects from it. The number of steps that simply don't exist in an application that makes effective use of distributed caching is mind-boggling, so even if the database caches everything and the cache has absolutely zero latency, you will often still lose (compared to a distributed cache) if you have to go to the database.
> Currently, what are the rules of thumb in terms of when
> to consider caching data external to the database (what
> datasize), etc)?
It's not about size, it's about whether the application can cache the data at all. If the application isn't able to cache it for security, transactional, or whatever reason, then the rule of thumb is to not cache ;-)
If the application can cache, then it should, because (just like layered caches in hardware) the closer you can cache the data to the processing (where the data is being used), the better the performance and scalability.
Regarding size, it really depends on the use case. Generic caching often begins to show a benefit with small cache sizes, and increases its effectiveness as the cache size grows up to the size of the entire data set (typically with diminishing returns, of course). It's a cost/benefit analysis at that point.
On the other hand, some use cases require the entire data set to be in memory in order to work -- things like analytics or querying from the cache. In these cases, the question is whether a cache-based architecture makes sense at all (as described above).
Note that since I work in a relatively small, high-end niche (data grids) with customers that all need to cache basically everything in RAM, my sample set is badly skewed.
I hope that helps to answer your questions, although I realize that it's not a cut and dry answer. IMHO, anyone who tries to pitch a one-size-fits-all solution for this problem domain is either an idiot or a liar, and there are at least a few candidates ;-)
Peace,
Cameron Purdy } Oracle Coherence
http://coherence.oracle.com/