Angle Brackets and Curly Braces
Database Partitioning for Scale
by Bill Venners
May 1, 2006

Summary
When you design a system, you have to decide how many compromises to make in the name of future scalability. One scalability strategy is to partition data into multiple databases. When do you think the tradeoffs involved are worth the scalability gained?

Tony Hoare and Donald Knuth said that premature optimization is the root of all evil. Nevertheless, not all early optimization is premature. Although early optimization in the small is usually premature, in the large it is often prudent. For example, marking a Java method final merely because you think that will make your program run faster (optimization in the small) is premature. But designing a chunky rather than a chatty protocol between the distributed components of your system (optimization in the large) may be entirely appropriate. If a performance problem is later discovered that requires optimization in the small, you can solve it by iteratively profiling to identify bottlenecks, making in-the-small optimizations, and testing them again with the profiler to make sure the changes actually help. If, on the other hand, you need to do major architectural surgery to solve the performance problem, the process can be far more expensive in time and money.

For similar reasons, worrying about scalability up front at the architectural level is often prudent. You can easily go overboard with the scalability concern, however, because many systems will never need to scale. Sometimes it is preferable to get the system out the door quickly on one server to start getting real-world feedback, and deal with scaling later if you actually need to. Nevertheless, if you think there's a good chance you'll need to scale up eventually, worrying about how you will do so early on can make it much faster and cheaper to do so later, if that day does indeed arrive.

Database partitioning

In our new architecture at Artima, we have thought a lot about scalability, and one of our strategies is database partitioning. To the extent possible, we wanted to avoid having everything in one database. The system we are designing is a network of sites, and we felt that it would be prudent to make it possible for each site to be in its own database. On the other hand, we didn't want to be forced to put every site in a different database, because we were concerned that more databases would be more complicated to manage than fewer databases. As a result, we developed the concept of a "sites" database that can host the data for one to many sites. We can deploy multiple sites databases, each of which will be connected to a cluster of application servers that will host all the sites in that database.

We have been using Postgres as our database at Artima for many years, and would like to continue to do so. We believe that one benefit of a partitioned database architecture is that we won't have to migrate everything over to a commercial database, such as Oracle, to help us scale up. If a particular sites database becomes overwhelmed with load, we can divide it into two databases, putting half of its sites into each new database—somewhat like a cell dividing.

Another benefit that we see of our partitioned database architecture is that if a database goes down, the entire network of sites doesn't go down. Only those sites hosted by that database go down. We would like eventually to set up some kind of database mirroring, so that if a database goes down, the mirror would take over shortly thereafter. But until that day, and even after that day in a catastrophic failure when both a database and its mirror fails, we'll only have a partial outage.

We weren't able to put everything into the sites databases, however, because some data is shared among all sites. In particular, we want to provide single-sign on, such that one Artima ID will work at all sites. Therefore, the user data is shared among all sites. As a result, we have a "shared" database that contains this shared data. The shared database is currently a single point of failure. If it goes down, the entire network of sites will go down. One way to improve the situation would be to replicate the user data into each sites database, such that if the shared database goes down, only the ability to update the user data would go away across the network. Anything that only required reading the user data would still work.

Design tradeoffs

Although database partitioning does provide scalability advantages, it has some drawbacks as well. First, as I mentioned previously, more databases is more databases to maintain, upgrade, back up, etc. Another drawback to partitioning databases is that because we want to avoid distributed transactions (because of their drag on scalability), we lose some ability to use the database to enforce data integrity constraints. For example, given that users are in the shared table, we can't add a foreign key constraint to the author's user ID field of a forum post in the sites table. (Were we to replicate read-only user data to each sites database, on the other hand, we could apply the constraint.)

Another example is back links. If site B publishes a translation of an article previously published in site A, we'd like to place a link in B that points to the article in A, and a link in A that points to the article in B. If both A and B were in the same database, then we could write both ends of the link in one transaction. Given that A and B can be in different databases, however, we either need to perform these writes under a distributed transaction, or write the back link asynchronously by sending a message, and accept that occasionally the back link may not get written. Because we feel scalability trumps integrity in this case, we opted for the latter approach. Back links will be written and modified by sending asynchronous messages, some of which may occasionally not arrive.

Another cost of the database partitioning approach is complexity, because if you want to avoid distributed transations, updating multiple databases is more complicated than updating a single one. A good example of this extra complexity is the writing and updating of back links. Instead of simply updating a back link directly in the same database, we must design a messaging system that allows one site to send a message to another.

Premature or prudent?

Are the manangement, integrity, and complexity tradeoffs I mentioned worth the scalability benefits? If our network needs to scale up, then I'd say yes. If not, then no. It is too soon to say. As developers, we need to make investment decisions based on our experience and the information we have at the time. I decided to take on the tradeoffs up front so that if and when the time comes that a database becomes overwhelmed, we'll have database partitioning as one tool in our scalability arsenal.

How much do you worry about scalability up front? When do you think it is and is not appropriate to complicate your design to carve out paths for future scalability? In particular, what is your opinion on database partitioning as a scalability strategy?

Talk Back!

Have an opinion? Readers have already posted 15 comments about this weblog entry. Why not add yours?

RSS Feed

If you'd like to be notified whenever Bill Venners adds a new entry to his weblog, subscribe to his RSS feed.

Digg |

del.icio.us |

About the Blogger

Bill Venners is president of Artima, Inc., publisher of Artima Developer (www.artima.com). He is author of the book, Inside the Java Virtual Machine, a programmer-oriented survey of the Java platform's architecture and internals. His popular columns in JavaWorld magazine covered Java internals, object-oriented design, and Jini. Active in the Jini Community since its inception, Bill led the Jini Community's ServiceUI project, whose ServiceUI API became the de facto standard way to associate user interfaces to Jini services. Bill is also the lead developer and designer of ScalaTest, an open source testing tool for Scala and Java developers, and coauthor with Martin Odersky and Lex Spoon of the book, Programming in Scala.


	Web Artima.com