Artima Developer Spotlight Forum - Seven Lessons on Scalability from Reddit

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Artima Developer Spotlight Forum
Seven Lessons on Scalability from Reddit

51 replies on 4 pages. Most recent reply: Jun 29, 2010 5:15 PM by John Zabroski

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 51 replies on 4 pages [ « | 1 2 3 4 | » ]

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Seven Lessons on Scalability from Reddit

Posted: May 19, 2010 1:55 PM

Reply

Advertisement

http://www.google.com/search?q=adding+columns+to+large+tables

I really don't know how on earth that could be a hard problem to solve.

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Seven Lessons on Scalability from Reddit

Posted: May 19, 2010 2:10 PM

Reply

> Well there's going to be state somewhere. Do you mean
> you'd keep all state in the database initially?

Sorry. I'm not making myself clear. I mean stateless in the REST sense. That is, no state is shared across requests. And if you need a client state (say a login session) there are a number of options for client maintained state.

You could store use session information on the DB but I would want a good reason for doing that. Anyway, if you keep things stateless, it makes things so much easier. Especially caching.

I'm not saying you should never have stateful web sites. I'm just saying that if you can avoid it, it makes things a lot simpler. For something like reddit, I would guess stateless is the way to go.

Morgan Conrad

Posts: 307
Nickname: miata71
Registered: Mar, 2006

Re: Seven Lessons on Scalability from Reddit

Posted: May 19, 2010 6:12 PM

Reply

Having worked on an early-development phase project where SQL tables were frequently changing, my experience says that John is glossing over many of the problems. Sure, it is "easy" to write script after script, update_v1.0_to_v1.1.sql, update_v1.1_to_v1.2.sql, update_v1.17btest3_tov1.17fc1, but it quickly becomes non-easy to keep up to date. Also, you collect some preliminary data that you are used to using for use-case studies, a standard test run, or similar, there's an update, and you may not be able to run using the old data anymore.

IMO, a "big sloppy hash table" database, or, perhaps better, an OODB, is superior during the early development phase. Note that, even though we used an SQL database, several of our tables were effectively big key-value maps anyway cause the underlying structure was fluid.

Also, later in the project, we considered various Map-Reduce systems such as Hadoop. Which seem to be designed for big sloppy hash tables. Here I'm far from an expert, but if non-SQL databases work better with massively parallel distributed "cloudy" systems, (and I have no idea if they do) that's something to discuss here at Artima.

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Seven Lessons on Scalability from Reddit

Posted: May 20, 2010 8:16 AM

Reply

I believe the phrase "glossing over" suggests intentional deceit.

All the issues you appear to struggle with really come down to the lack of a deterministic, reliable build process.

Sorry, I am not glossing over anything here. You can mathematically do semi-automatic script generation using program invariants.

If you need me to explain this, I will.

What most programmers lack is insight, or data mining, on how they design systems. That is why I pressed James to tell me WHY reddit switched to CassandraDB. What STATISTICAL MEASURES did they make to evaluate the effectiveness of the move? HOW EFFECTIVE was it? WHAT configuration options and partitioning strategies did they try while using RDBMSes? WHAT was their time/cost expenditure while using a RDBMS? WHICH RDBMS? WHAT are the read/write characteristics? WHAT was the mean-time-to-failure? You know, a REAL engineering discussion. Not a superfluous list.

If you think I am "glossing over" anything, I am not. It is simply that I am providing only as much detail as perhaps is necessary to debunk a poor presentation that will likely infect the minds of poor college souls.

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Seven Lessons on Scalability from Reddit

Posted: May 20, 2010 8:20 AM

Reply

> Having worked on an early-development phase project where
> SQL tables were frequently changing, my experience says
> that John is glossing over many of the problems.

I currently work on a similra project, so I am not sure how our requirements differ?

Perhaps you need to be very explicit in what problems you experience. I can then explain in very deterministic terms how to solve those problems, based on proven experience.

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Seven Lessons on Scalability from Reddit

Posted: May 20, 2010 8:41 AM

Reply

> That is why I pressed James to
> tell me WHY reddit switched to CassandraDB. What
> STATISTICAL MEASURES did they make to evaluate the
> effectiveness of the move? HOW EFFECTIVE was it? WHAT
> configuration options and partitioning strategies did they
> try while using RDBMSes? WHAT was their time/cost
> expenditure while using a RDBMS? WHICH RDBMS? WHAT are
> the read/write characteristics? WHAT was the
> mean-time-to-failure? You know, a REAL engineering
> discussion. Not a superfluous list.

I don't understand why you would press me for this information. I don't work for reddit nor do I use any "no-sql" databases. Unless they publish their reasons, how would I know? I don't really have an option or a need to switch to a different type of DB. I don't need the specific pieces of information to consider this as a solution in the future though. I would do that analysis myself.

Here's a link if you are interested:

http://www.infoq.com/news/2010/03/Digg-Reddit-NoSQL-Cassandra

On the Casandra website there's a bunch of links to company blogs about how they use Casandra. If this is the kind of information you are looking for, I suggest you do some research. Perhaps you could provide this kind of information on why you think RDBMS is better.

http://cassandra.apache.org/

Daniel Jimenez

Posts: 40
Nickname: djimenez
Registered: Dec, 2004

Re: Seven Lessons on Scalability from Reddit

Posted: May 20, 2010 9:36 AM

Reply

> You know, a REAL engineering
> discussion. Not a superfluous list.

Can I just ask, when and where has this ever happened? I'd like to hope it does at, say, NASA, or for biomedical devices, places where peoples' lives depend on software, but I doubt it does. (I have some knowledge of the contrary in the latter field, for example.)

It certainly hasn't occurred at any of the various corporate businesses in which I've worked, much to my chagrin.

How would you justify the cost of doing such an analysis to a leadership group that knows nothing about the difficulties of writing software and therefore thinks every delivery time is too late?

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Seven Lessons on Scalability from Reddit

Posted: May 20, 2010 9:51 AM

Reply

Daniel,

In my experience, building a track record is the only way you can go beyond the simple, typical agile model of buzzwords (TDD/BDD, DDD, unit tests, refactoring, etc.)

Sometimes you have to bootstrap yourself with those weaker methods in order for you to make a case for better methods.

James,

I was not demanding you answer for Steve Huffman or whoever works at reddit now. I was simply saying "who cares"; iff they do not measure these things, then it is pseudo-science and cargo cult programming. I realize that bar might be too high for most programmers to agree with me.

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Seven Lessons on Scalability from Reddit

Posted: May 20, 2010 10:06 AM

Reply

> James,
>
> I was not demanding you answer for Steve Huffman or
> whoever works at reddit now. I was simply saying "who
> cares"; iff they do not measure these things, then it is
> pseudo-science and cargo cult programming. I realize that
> bar might be too high for most programmers to agree with
> me.

I get it. It would be very interesting to see that kind of analysis. From what I can tell, the advantage of using something like Casandra really comes down it's decentralized architecture. I think the advantages of this are obvious but I can attempt add more detail if you need it.

The question I have is can a relational DB be decentralized in the same way? If it can, then why are so many companies using tools like Casandra?

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Seven Lessons on Scalability from Reddit

Posted: May 20, 2010 10:07 AM

Reply

> Here's a link if you are interested:
>
> http://www.infoq.com/news/2010/03/Digg-Reddit-NoSQL-Cassand
> ra

From there: "The main culprit was MySQL because, as any other SQL database, it is optimized for reads and cannot handle writes properly:"

In short, MySQL is a difficult tool to use for a site like reddit. In general, I agree, but I think this is so obvious as not worth stating; this is programming 102. However, all SQL databases are not optimized for reads -- so InfoQ is simply wrong and the editors for InfoQ should be ashamed for allowing such sloppy content enter their site. In particular, I can think of a Person XYZ who should be embarassed. MySQLs success on the web for CMSes mostly has to do with the fact CMSes have read-heavy, write-infrequent data access patterns. This propelled MySQLs popularity when commingled with spaghetti coding languages like PHP3, .e.g. look at Drupal and Joomla as two examples.

PostgreSQL is optimized for writes, and uses a different transaction control theory model. This makes writing rollbacks in triggers and putting data validation logic in triggers surprisingly cheap compared to MySQL (you shouldn't do this, anyway, despite it being cheap, since the whole purpose of fast writes is to have writes succeed and not block the CPU for contention with reads).

A practical design technique -- called wait-based performance tuning -- suggests business logic is best placed outside the database, but also the business logic should guarantee any data into or out of the database must've come from a database 'as if' the database was enforcing all integrity.

Also from the article: "This growth has forced us into horizontal and vertical partitioning strategies that have eliminated most of the value of a relational database, while still incurring all the overhead. …"

I don't understand the mind of somebody who writes such uninformative statements as these. It just sweeps all design issues under the rug and excuses it as, "look, it didn't work, who really cares why, so here are some generic reasons to pacify you, because you are so stupid you'll gobble up any explanation I give you".

> Perhaps you could provide this kind of
> information on why you think RDBMS is better.

I would never make such a blanket statement.

Morgan Conrad

Posts: 307
Nickname: miata71
Registered: Mar, 2006

Re: Seven Lessons on Scalability from Reddit

Posted: May 20, 2010 12:46 PM

Reply

>Perhaps you need to be very explicit in what problems you experienced

1) creating the scripts was not a problem. Keeping up to date on them was. Some of can be blamed on an unreliable build process. But see #2.

2) actual data came from a prototype machine, and was rare and "valuable". I was developing code to analyze the data. In addition to passing typical unit tests, I was working on speed improvements, while preserving (or improving) accuracy and adding features. There was also "let's click through the GUI and make sure nothing breaks".

After many of the schema updates, previously acquired data was no longer valid. Some field was added and not-null, some table was seriously refactored, etc. Suddenly, I could no longer work. So I'd keep an older DB around. Or else, if I saw a new updateV3_to_v4.sql script appear, I'd deliberately NOT check it out. As would other developers and testers. Which made issue #1 a real big deal.

So the real issue is #2, and I don't how older well-factored SQL data can be automagically updated if it requires new data that isn't there. Even if the data is there writing the script may be tricky? (DOn't know, I wasn't the DB/SQL expert!) And this problem fed into #1. I'd appreciate your thoughts if you see something we were missing.

Kay Schluehr

Posts: 302
Nickname: schluehk
Registered: Jan, 2005

Re: Seven Lessons on Scalability from Reddit

Posted: May 20, 2010 9:46 PM

Reply

> What most programmers lack is insight, or data mining, on
> how they design systems. That is why I pressed James to
> tell me WHY reddit switched to CassandraDB. What
> STATISTICAL MEASURES did they make to evaluate the
> effectiveness of the move? HOW EFFECTIVE was it? WHAT
> configuration options and partitioning strategies did they
> try while using RDBMSes? WHAT was their time/cost
> expenditure while using a RDBMS? WHICH RDBMS? WHAT are
> the read/write characteristics? WHAT was the
> mean-time-to-failure? You know, a REAL engineering
> discussion. Not a superfluous list.

When you'd expose your companies database architecture on the level of detail that measurements can be reproduced you'd probably get fired but we surely had a better discussion ;)

Fred Finkelstein

Posts: 48
Nickname: marsilya
Registered: Jun, 2008

Re: Seven Lessons on Scalability from Reddit

Posted: May 21, 2010 7:50 AM

Reply

I agree with Zabrowski.

If you don't know what circumstances led to the rules, then it is just like magic - do this and be happy. Also it is dangerous - because someone might think he should not only read these rules but really apply them! If the conditions of his project differ this can lead to a disaster.

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Seven Lessons on Scalability from Reddit

Posted: May 21, 2010 8:12 AM

Reply

> > You know, a REAL engineering
> > discussion. Not a superfluous list.
>
> When you'd expose your companies database architecture on
> the level of detail that measurements can be reproduced
> you'd probably get fired but we surely had a better
> discussion ;)

Uh...

I don't get it.

These social networking sites are all using CassandraDB now anyway. How on earth can you possibly tell me that BEFORE vs. AFTER benchmarks are proprietary?

If your company is protecting its failures, it will never move forward. Moreover, the only reason to share its successes is so that others may help improve the success. Otherwise, it makes no sense for Facebook to open source Cassandra. So it is very important that to actually improve things you have great benchmarks and understanding of engineering trade-offs.

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Seven Lessons on Scalability from Reddit

Posted: May 21, 2010 8:14 AM

Reply

Let me just add that the raison d'etre for social networking sites to share scaling code is that none of these sites appear to make much money. It is stupid to have an inhouse staff paid millions of kronor, dollars, yen, pounds, euro, whatever. -- You'll never make your money back unless Google or Microsoft or Rupert Murdoch wants your customer base.

Flat View: This topic has 51 replies on 4 pages [ « | 1 2 3 4 | » ]

Previous Topic

Next Topic

Sponsored Links

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use