The Artima Developer Community
Sponsored Link

Weblogs Forum
Database Partitioning for Scale

15 replies on 2 pages. Most recent reply: Nov 4, 2013 5:18 PM by Keren Blau

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 15 replies on 2 pages [ 1 2 | » ]
Bill Venners

Posts: 2248
Nickname: bv
Registered: Jan, 2002

Database Partitioning for Scale (View in Weblogs)
Posted: May 1, 2006 6:59 PM
Reply to this message Reply
Summary
When you design a system, you have to decide how many compromises to make in the name of future scalability. One scalability strategy is to partition data into multiple databases. When do you think the tradeoffs involved are worth the scalability gained?
Advertisement

Tony Hoare and Donald Knuth said that premature optimization is the root of all evil. Nevertheless, not all early optimization is premature. Although early optimization in the small is usually premature, in the large it is often prudent. For example, marking a Java method final merely because you think that will make your program run faster (optimization in the small) is premature. But designing a chunky rather than a chatty protocol between the distributed components of your system (optimization in the large) may be entirely appropriate. If a performance problem is later discovered that requires optimization in the small, you can solve it by iteratively profiling to identify bottlenecks, making in-the-small optimizations, and testing them again with the profiler to make sure the changes actually help. If, on the other hand, you need to do major architectural surgery to solve the performance problem, the process can be far more expensive in time and money.

For similar reasons, worrying about scalability up front at the architectural level is often prudent. You can easily go overboard with the scalability concern, however, because many systems will never need to scale. Sometimes it is preferable to get the system out the door quickly on one server to start getting real-world feedback, and deal with scaling later if you actually need to. Nevertheless, if you think there's a good chance you'll need to scale up eventually, worrying about how you will do so early on can make it much faster and cheaper to do so later, if that day does indeed arrive.

Database partitioning

In our new architecture at Artima, we have thought a lot about scalability, and one of our strategies is database partitioning. To the extent possible, we wanted to avoid having everything in one database. The system we are designing is a network of sites, and we felt that it would be prudent to make it possible for each site to be in its own database. On the other hand, we didn't want to be forced to put every site in a different database, because we were concerned that more databases would be more complicated to manage than fewer databases. As a result, we developed the concept of a "sites" database that can host the data for one to many sites. We can deploy multiple sites databases, each of which will be connected to a cluster of application servers that will host all the sites in that database.

We have been using Postgres as our database at Artima for many years, and would like to continue to do so. We believe that one benefit of a partitioned database architecture is that we won't have to migrate everything over to a commercial database, such as Oracle, to help us scale up. If a particular sites database becomes overwhelmed with load, we can divide it into two databases, putting half of its sites into each new database—somewhat like a cell dividing.

Another benefit that we see of our partitioned database architecture is that if a database goes down, the entire network of sites doesn't go down. Only those sites hosted by that database go down. We would like eventually to set up some kind of database mirroring, so that if a database goes down, the mirror would take over shortly thereafter. But until that day, and even after that day in a catastrophic failure when both a database and its mirror fails, we'll only have a partial outage.

We weren't able to put everything into the sites databases, however, because some data is shared among all sites. In particular, we want to provide single-sign on, such that one Artima ID will work at all sites. Therefore, the user data is shared among all sites. As a result, we have a "shared" database that contains this shared data. The shared database is currently a single point of failure. If it goes down, the entire network of sites will go down. One way to improve the situation would be to replicate the user data into each sites database, such that if the shared database goes down, only the ability to update the user data would go away across the network. Anything that only required reading the user data would still work.

Design tradeoffs

Although database partitioning does provide scalability advantages, it has some drawbacks as well. First, as I mentioned previously, more databases is more databases to maintain, upgrade, back up, etc. Another drawback to partitioning databases is that because we want to avoid distributed transactions (because of their drag on scalability), we lose some ability to use the database to enforce data integrity constraints. For example, given that users are in the shared table, we can't add a foreign key constraint to the author's user ID field of a forum post in the sites table. (Were we to replicate read-only user data to each sites database, on the other hand, we could apply the constraint.)

Another example is back links. If site B publishes a translation of an article previously published in site A, we'd like to place a link in B that points to the article in A, and a link in A that points to the article in B. If both A and B were in the same database, then we could write both ends of the link in one transaction. Given that A and B can be in different databases, however, we either need to perform these writes under a distributed transaction, or write the back link asynchronously by sending a message, and accept that occasionally the back link may not get written. Because we feel scalability trumps integrity in this case, we opted for the latter approach. Back links will be written and modified by sending asynchronous messages, some of which may occasionally not arrive.

Another cost of the database partitioning approach is complexity, because if you want to avoid distributed transations, updating multiple databases is more complicated than updating a single one. A good example of this extra complexity is the writing and updating of back links. Instead of simply updating a back link directly in the same database, we must design a messaging system that allows one site to send a message to another.

Premature or prudent?

Are the manangement, integrity, and complexity tradeoffs I mentioned worth the scalability benefits? If our network needs to scale up, then I'd say yes. If not, then no. It is too soon to say. As developers, we need to make investment decisions based on our experience and the information we have at the time. I decided to take on the tradeoffs up front so that if and when the time comes that a database becomes overwhelmed, we'll have database partitioning as one tool in our scalability arsenal.

How much do you worry about scalability up front? When do you think it is and is not appropriate to complicate your design to carve out paths for future scalability? In particular, what is your opinion on database partitioning as a scalability strategy?


Carsten Saager

Posts: 17
Nickname: csar
Registered: Apr, 2006

Re: Database Partitioning for Scale Posted: May 2, 2006 6:19 AM
Reply to this message Reply
Any optimization relies on exploiting invariants, an optimization is premature if the invariant is only a thought invariant.

Making a method final because you think this method will never need to be overriden is premature, but making it final because you want it to be safe to be used from the constructor is prudent.

What you describe is more the clash between experienced architects and everythingcanbedoneagile architects. The first group already build enterprise-systems and experienced also the typical problems when these systems evolve and the load and requirements surge. The second group believes that everything can be done in incremental steps and they are right 80% of the time for 80% of the system - do the math: That leaves a 36% chance that it does not work out.

The pivot is what you called "design tradeoff". At some point you have to give up an assumption on your invariants (#DB==1) that will hit the codebase in a whole. Such a severe change usually breaks your neck unless you have planned for it; or you'll have delays, quality problems etc.

When you know your domain, you can foresee common problems and it is simply stupid not to include them into the design. This is sometimes done as a (premature) optimization, because the team does not fully understand the architecture otherwise; although the actual effort of implementing it is smaller than one might think, more important the workload is more likely to be constant over time so that the project stays managable.

Dinesh Bhat

Posts: 5
Nickname: dinesh
Registered: Nov, 2002

Re: Database Partitioning for Scale Posted: May 2, 2006 10:28 AM
Reply to this message Reply
Optimization related issues were always complicated and lately have become more so due to the emergence of agile methodologies ( or rather their incorrect interpretation ) due to the tendency to code first and think later and claim that it is the simplest solution.

Premature optimization, I feel is just another facet of over engineering/designing. It increase the cost of maintenance.
Having said that, I think it applies only to the implementation level. Performance, scalability has to be thought through at the architectural level in the same way as we think about security, data integrity and other non-functional requirements. The technologies/implmentation needed cannot be an afterthought as they are very expensive to change later in the development cycle (architectural impact). Also policies/coding standards have to be set in place to make sure that bad practices can silently creep and eventually blow up the application.

For eg. I would like to ensure that every transaction initiated in the middle tier can only make one database call ( there are exceptions to this rule). Use store procedures if necessary to accomplish the same. Multiple calls to the database affect performance.
Other obvious ones are to use non-synchronized collections wherever concurrency is not needed.
My experience is these bad practices can build up, but it is also true that it is (relatively) easier to corect these issues later in the development cycle.

Regarding database partioning, I have always felt that vertical scaling though expensive in the short term is cheaper in the long term. Horizontal scaling involves lot of maintenance and is inefficent ( related to performance). Of course we have better availabililty as a result.
A typical example in the earlier days of J2EE was to use the database as a data dump and implement all the logic in the middle tier and then use the panacea of horizontal scaling as it turned out, that did not work out. To the rescue emerged (clustered) caching solutons and now there are indexing solutions for the caches. I get the feeling that they are trying to use another (like) database to cache a database.

I think about horizontal partioning only in terms of availability if possible and try to ensure that the application can scale vertically ( efficiently).
Similarly partioning across databases may work in the short term given the lack of resources, but maintenance and enhancements would be more expensive, though it may also buy more time to acquire more resources. Not using distributed transactions may help performance, but would incur additional development/support costs to maintain data integrity.

As with other aspects of software development, you are just implementing the best solution available given the resources. I do agree that we need to think about those issues to make the correct trade-offs. Obviously, Architectural experience comes in handy when making these trade-offs. If not then protyping upfront helps a little, but due to lack of experience, the variables considered will not be adequate, but will eventually help in gaining that experience.

Bill Venners

Posts: 2248
Nickname: bv
Registered: Jan, 2002

Re: Database Partitioning for Scale Posted: May 2, 2006 12:38 PM
Reply to this message Reply
Dinesh Bhat wrote:

> Regarding database partioning, I have always felt that
> vertical scaling though expensive in the short term is
> cheaper in the long term. Horizontal scaling involves lot
> of maintenance and is inefficent ( related to
> performance). Of course we have better availabililty as a
> result.
>
> A typical example in the earlier days of J2EE was to use
> the database as a data dump and implement all the logic in
> the middle tier and then use the panacea of horizontal
> scaling as it turned out, that did not work out. To the
> rescue emerged (clustered) caching solutons and now there
> are indexing solutions for the caches. I get the feeling
> that they are trying to use another (like) database to
> cache a database.
>
> I think about horizontal partioning only in terms of
> availability if possible and try to ensure that the
> application can scale vertically ( efficiently).
> Similarly partioning across databases may work in the
> short term given the lack of resources, but maintenance
> and enhancements would be more expensive, though it may
> also buy more time to acquire more resources. Not using
> distributed transactions may help performance, but would
> incur additional development/support costs to maintain
> data integrity.
>
That's really great feedback, thanks. I don't have a lot of experience designing scalable enterprise systems, so in the end I hedged my bets. Our design gives us the option of partitioning different sites or groups of sites into different databases, but doesn't force us to do so. We could also have every site in one database. Initially I figured I'd spread the initial sites across two databases, just to make sure the partitioning worked, and then see how it goes. I am concerned about the added maintenance costs of database partitioning, but figure it is good to give us choice in the future. We can easily split up a sites database. We can also combine them, though that's a bit trickier and not something I figured we'd do. But it will be possible.

The main complexity cost is that I need to define some kind of way for one site to send a message to another, so I'm going to do that. That will allow us to use partitioning to help us scale, to the extent that turns out to be an issue, and to manage the impact of outages. And the real potential for loss of data integrity is back links. In our case, that's quite acceptable. I'd like intra-Artima links to always work, but if occasionally they don't, users can click the back button. And we can also have a spider slowly crawling the anyway, looking for broken links inside the network as well as to the outside.

On the subject of outages, Frank Sommers pointed me to a Wall Street Journal article that was talking about the outages at salesforce.com. Their customers were put in a quite a bind by the outages, because their sales efforts came to a standstill until the site came back up. One of the things they mentioned was using database partitioning to minimize the impact of outages (It is called "Web Services Face Reliability Challenges and was published February 23, 2006. I can't link to it because it is behind a pay wall.)":

[Salesforce.com] has long combined customer information for major geographic areas into large databases, but is considering dividing up those databases so that an outage would affect fewer customers at a time..."

I really see such design decisions as similar to making investment decisions with money. You have incomplete information about the future, and yet you must make decisions today on how to apply your development resources. There's no right answer that always works. You just have to place your bets today and live with those decisions in the future.

Dinesh Bhat

Posts: 5
Nickname: dinesh
Registered: Nov, 2002

Re: Database Partitioning for Scale Posted: May 2, 2006 10:34 PM
Reply to this message Reply
I cannot consider myself to be an expert on this issue by any stretch of ones imagination, though I have been dealing with performance and scalability issues for sometime and have some knowledge of the failover solutions used by our customers.

>The main complexity cost is that I need to define some kind of
>way for one site to send a message to another, so I'm going to
> do that. That will allow us to use partitioning to help us
>scale, to the extent that turns out to be an issue, and to
>manage the impact of outages. And the real potential for loss
>of data integrity is back links. In our case, that's quite
>acceptable. I'd like intra-Artima links to always work, but if
> occasionally they don't, users can click the back button. And
> we can also have a spider slowly crawling the anyway, looking
> for broken links inside the network as well as to the
>outside.
The other choice is to (have a single database and) replicate you database and have a cold/hot failover solution. The replication can be realtime or a backup at the end of the day ( if you are willing to lose data ). It could be a cold failover if you can allow the site to be down for the time the database needs to failover (manual or auto).
Your choice depends on the cost of developing and maintaining the database partitions versus the cost of buying and setting up third party solutions to support failover of a single database.
Of course you still need to replicate your partitioned databases. So that cost is common to both solutions.
Also it seems like your database server cannot handle your current/future data needs. If it is more cost effective for you to buy additional (smaller capacity) hardware than to scale the hardware vertically, then the choice is easily made for you.
>I really see such design decisions as similar to making
>investment decisions with money. You have incomplete
>information about the future, and yet you must make
>decisions today on how to apply your development resources.
> There's no right answer that always works. You just have
>to place your bets today and live with those decisions in
>the future.
Working for a startup (Panayca Inc) myself, we have to make these decisions every day while trying to use the resource dollars for the most important features and use technologies that keeps the cost of our product low. I am certain that I get it right less the half the time, due to the nature of the applications we monitor, but I try to include a workaround. "Manual Override" is my favourite mantra. It can save one for many a sticky situation :).

David Vydra

Posts: 60
Nickname: dvydra
Registered: Feb, 2004

Re: Database Partitioning for Scale Posted: May 3, 2006 12:18 PM
Reply to this message Reply
Bill,

Very interesting and timely topic for me. Take a look at this paper: http://www.rhic.bnl.gov/RCF/UserInfo/Meetings/Technology/Irina_Sourikova_techMeet.pdf

Regards,
David
www.testdriven.com

David Vydra

Posts: 60
Nickname: dvydra
Registered: Feb, 2004

Re: Database Partitioning for Scale Posted: May 3, 2006 12:24 PM
Reply to this message Reply
On the subject of outages, lets start with your DNS because thats how anyone can find you and connect to you in the first place. Here is what dnsreport.com has to say about artima.com domain:

WARNING: All of your nameservers (listed at the parent nameservers) are in the same Class C (technically, /24) address space, which means that they are probably at the same physical location. Your nameservers should be at geographically dispersed locations. You should not have all of your nameservers at the same location. RFC2182 3.1 goes into more detail about secondary nameserver location.

Cheers,
David

> On the subject of outages, Frank Sommers pointed me to a
> Wall Street Journal article that was talking about the
> outages at salesforce.com. Their customers were put in a
> quite a bind by the outages, because their sales efforts
> came to a standstill until the site came back up. One of
> the things they mentioned was using database partitioning
> to minimize the impact of outages (It is called "Web
> Services Face Reliability Challenges and was published
> February 23, 2006. I can't link to it because it is behind
> a pay wall.)":
>
> [Salesforce.com] has long combined customer information
> for major geographic areas into large databases, but is
> considering dividing up those databases so that an outage
> would affect fewer customers at a time..."

Dinesh Bhat

Posts: 5
Nickname: dinesh
Registered: Nov, 2002

Re: Database Partitioning for Scale Posted: May 3, 2006 3:09 PM
Reply to this message Reply
I primarily deal with Oracle Database. The presented solution for Postgres is pretty good.

Bill Venners

Posts: 2248
Nickname: bv
Registered: Jan, 2002

Re: Database Partitioning for Scale Posted: May 3, 2006 3:40 PM
Reply to this message Reply
David Vydra wrote:
>
> Very interesting and timely topic for me. Take a look at
> this paper:
> http://www.rhic.bnl.gov/RCF/UserInfo/Meetings/Technology/Ir
> ina_Sourikova_techMeet.pdf
>
Thanks, David. Reading this reminded me of one other scenario that I wanted to make possible by this early architectural decision, but which I forgot to mention in my post: a geographically dispersed cluster. Were we to host community sites for Japanese folks, for example, we might someday want to put some servers close by in Japan and host those sites there to make them more responsive. Distributing servers geographically is certainly not something we're planning to do in the near future, but by making it possible to have different sites in different databases on the same LAN, I figured it would be easier to put some of those databases at other data centers located elsewhere as well.

I think the main benefit that architecting things this way gives us is choice. No matter what comes, we should be able to hopefully adjust without drastic architectural surgery.

Rodrigo Senra

Posts: 1
Nickname: rsenra
Registered: Mar, 2006

Re: Database Partitioning for Scale Posted: May 3, 2006 8:53 PM
Reply to this message Reply
It was a very interesting perspective: design for scalability as optimization in the large. That makes me think about the blur in the frontier between efficiency and optimization.

Philosophy aside, coincidentally (or not) database partitioning was the recurring theme amongst my clients this week. At least, it might fulfill their needs.

Perhaps, what we are lacking is for "reliable group communication" to become a commodity, just like {TCP,UDP}/IP. Provided we had this reliable, totally ordered, efficient layer, then replicated database would be one step away. Although, this is easier said than done.

I have seen a lot of academical gossip about scalable, cost-effective replicated databases. Yet, I have not seen that Grail in my FLOSS toolbox.

David Vydra

Posts: 60
Nickname: dvydra
Registered: Feb, 2004

Re: Database Partitioning for Scale Posted: May 4, 2006 12:38 AM
Reply to this message Reply
Good points. I also think that OS virtualization will work well with replicated databases. Does Artima really wants to be in the business of managing physical servers? I suspect not. Imagine a scenario where a certain service level can be purchased from a hosting company - lets say we want a constant response time - and virtual servers will be started automatically in appropriate geographical locations based on real-time usage analysis. I know folks are working on such technology, the question is when it will be cost effective for a small company?

-d


> Perhaps, what we are lacking is for "reliable group
> communication" to become a commodity, just like
> {TCP,UDP}/IP. Provided we had this reliable, totally
> ordered, efficient layer, then replicated database would
> be one step away. Although, this is easier said than
> done.
>
> I have seen a lot of academical gossip about scalable,
> cost-effective replicated databases. Yet, I have not seen
> that Grail in my FLOSS toolbox.

John Lopez

Posts: 2
Nickname: godeke
Registered: Jan, 2004

Re: Database Partitioning for Scale Posted: May 4, 2006 1:23 PM
Reply to this message Reply
I think that calling it "optimization" (premature or not) does a disservice to your goal. The goal isn't to speed up your system via arbitrary changes to the algorithm or data structures, it is to plan ahead for scalability. In fact, you point out a few places where you are losing reliability (thus requring background processing to "fix up" such issues) and adding communications overhead to the system: clearly not your typical premature optimization choices.

Calling it "premature scalability" would be a closer match to your concern, but again I see that the choices appear to be sound (the scalability tools don't require multiple databases to exist, just support such configurations).

Although I believe in test driven design and emergent architecture in the small, I believe that the concerns you are having are best served by thought up front. Scalability is probably the biggest downfall of agile methods when used too literally: agile methods within the framework of a larger architecture avoid the worst outcomes of being agile.

Bill Venners

Posts: 2248
Nickname: bv
Registered: Jan, 2002

Re: Database Partitioning for Scale Posted: May 4, 2006 4:03 PM
Reply to this message Reply
John Lopez wrote:

> I think that calling it "optimization" (premature or not)
> does a disservice to your goal. The goal isn't to speed up
> your system via arbitrary changes to the algorithm or data
> structures, it is to plan ahead for scalability. In fact,
> you point out a few places where you are losing
> reliability (thus requring background processing to "fix
> up" such issues) and adding communications overhead to the
> system: clearly not your typical premature optimization
> choices.
>
> Calling it "premature scalability" would be a closer match
> to your concern, but again I see that the choices appear
> to be sound (the scalability tools don't require multiple
> databases to exist, just support such configurations).
>
I believe I only referred to optimization in the context of performance optimization (in the first paragraph of the blog). To me, "optimization" implies you are trying to improve performance. Scalability is a different concern that is often at odds with raw performance. I.e., I expect that making things more scalable may slow down the application from the client's perspective. I called designing for scalability "worring about scalability," because I don't know of a good catchy term like optimization for making changes to improve scalability. Scalation?

> Although I believe in test driven design and emergent
> architecture in the small, I believe that the concerns you
> are having are best served by thought up front.
> Scalability is probably the biggest downfall of agile
> methods when used too literally: agile methods within the
> framework of a larger architecture avoid the worst
> outcomes of being agile.
>
I think what I am trying to do is minimize the cost of future change for certain scenarios that may not happen (but that I hope will happen), while minimizing the present and near term costs of that. I think the agile message of be suspicious of your ability to predict the future is very relevant to early scalability concerns, because 1) I may never need to scale and 2) if I do need to scale someday it may be in ways I can't forsee today.

So I'm investing a bit today, but an amount I feel I can afford and is reasonable given the circumstances. The near term cost is that I have to deal with a bit more complexity, with coming up with a way for one site to send a message to another. That's a real cost, but I consider it an investment, and all investment implies risk. Maybe it won't pay off, but hopefully it will.

The notion that many XP-enthusiasts seem to promote, which says that you should *never* do anything to fulfill requirements you don't have today, is to me too extreme. The reason is I don't believe XP practices will really make the cost of change curve flat. Major architectural surgery is expensive, so that's the most important part to get right, and getting it right is in my experience what makes the cost of change low. When your architecture anticipates the kinds of changes you'll need to make, then they are cheap and fast to make. I think that whether or not to partition our sites data into different databases here at Artima is a fairly fundamental architectural decision that would be costly for us to change later.

Bill Venners

Posts: 2248
Nickname: bv
Registered: Jan, 2002

Re: Database Partitioning for Scale Posted: May 4, 2006 4:51 PM
Reply to this message Reply
> Good points. I also think that OS virtualization will work
> well with replicated databases. Does Artima really wants
> to be in the business of managing physical servers? I
> suspect not. Imagine a scenario where a certain service
> level can be purchased from a hosting company - lets say
> we want a constant response time - and virtual servers
> will be started automatically in appropriate geographical
> locations based on real-time usage analysis.
>
It sounds good, but I'd really need to be convinced they could really deliver on the promise. There is cost to manage physical servers, but I would be afraid that delegating too much would leave me at the mercy of the kindness of vendors for uptime, response time, etc.--things that are rather important to user happiness.

David Vydra

Posts: 60
Nickname: dvydra
Registered: Feb, 2004

Re: Database Partitioning for Scale Posted: May 4, 2006 7:33 PM
Reply to this message Reply
Agreed. I don't have a specific company to recommend to you right now, but the good news is you can mix an match your servers with virtual ones and thus can incrementally gain trust.

-d

> It sounds good, but I'd really need to be convinced they
> could really deliver on the promise. There is cost to
> manage physical servers, but I would be afraid that
> delegating too much would leave me at the mercy of the
> kindness of vendors for uptime, response time,
> etc.--things that are rather important to user happiness.

Flat View: This topic has 15 replies on 2 pages [ 1  2 | » ]
Topic: Windows 8 is ... Not So Bad Previous Topic   Next Topic Topic: Atomic Scala eBook & Downloadable Hands-On Java eSeminar


Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2014 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use - Advertise with Us