Artima Developer Spotlight Forum - Database Denormalization and the NoSQL Movement

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Artima Developer Spotlight Forum
Database Denormalization and the NoSQL Movement

53 replies on 4 pages. Most recent reply: Sep 28, 2009 8:56 AM by John Zabroski

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 53 replies on 4 pages [ « | 1 2 3 4 | » ]

robert young

Posts: 361
Nickname: funbunny
Registered: Sep, 2003

Re: Database Denormalization and the NoSQL Movement

Posted: Sep 14, 2009 1:27 PM

Reply

Advertisement

> I love the querying power of SQL databases; nothing IMO
> comes close to SQL queries on a modern DB. That said, SQL
> DBs are not speed daemons: table scans on single columns,
> multiple joins and aggregates can be slow. But unless
> speed is an issue I would go SQL.
>
> I remember creating a denormalized table with the current
> total balance per client, because calculating the current
> balance based on all transactions for a client was just
> too slow.

As I have been saying for some years, and in the last 6 months it has come to be true for those who pay attention, SSD multi-core multi-processor machines remove any perceived join penalty. BCNF on such machines is faster than un/de normalized data on any machine.

STEC, Intel, Fusion-io, SanDisk are supplying drives to EMC, IBM, HP, Compellent, NetApp, 3Par, Sun and just about every vendor on the planet. The game has changed; there are New Rules (to quote Brother Maher).

Mark Thornton

Posts: 275
Nickname: mthornton
Registered: Oct, 2005

Re: Database Denormalization and the NoSQL Movement

Posted: Sep 15, 2009 1:09 AM

Reply

> As I have been saying for some years, and in the last 6
> months it has come to be true for those who pay attention,
> SSD multi-core multi-processor machines remove any
> perceived join penalty. BCNF on such machines is faster
> than un/de normalized data on any machine.

One of my awkward cases is storing matrices. The normalised form has rows containing row key, column key, and data value. Where the data value is small (a float or two) specialised structures are many times faster than any database I have yet seen.

nes

Posts: 137
Nickname: nn
Registered: Jul, 2004

Re: Database Denormalization and the NoSQL Movement

Posted: Sep 15, 2009 7:54 AM

Reply

> > multiple joins and aggregates can be slow. But unless
> > speed is an issue I would go SQL.
>
> Doesn't using proper indexes and foreign keys address
> this?

To some degree, but you still have to walk the index for each record in "log" time. If you have everything on the same row it is constant time.

> my experience is that straight queries using JDBC
> are an order of magnitude faster than Hibernate.

I don't have much experience with ORMs. I always thought they were used more for convenience than for speed.

nes

Posts: 137
Nickname: nn
Registered: Jul, 2004

Re: Database Denormalization and the NoSQL Movement

Posted: Sep 15, 2009 8:08 AM

Reply

> > multiple joins and aggregates can be slow. But unless
> > speed is an issue I would go SQL.
> >
> As I have been saying for some years, and in the last 6
> months it has come to be true for those who pay attention,
> SSD multi-core multi-processor machines remove any
> perceived join penalty. BCNF on such machines is faster
> than un/de normalized data on any machine.
>
> STEC, Intel, Fusion-io, SanDisk are supplying drives to
> EMC, IBM, HP, Compellent, NetApp, 3Par, Sun and just about
> every vendor on the planet. The game has changed; there
> are New Rules (to quote Brother Maher).

I would love that, but unfortunately the future hasn't arrived for me yet. Databases measured in TB are still a bit expensive on SSDs. Regardless, reading precalculated data will always be faster than on-the-fly. The hope is that in the future the speed difference will be small enough to be irrelevant in 99% of cases.

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Database Denormalization and the NoSQL Movement

Posted: Sep 15, 2009 8:34 AM

Reply

> > > multiple joins and aggregates can be slow. But unless
> > > speed is an issue I would go SQL.
> >
> > Doesn't using proper indexes and foreign keys address
> > this?
>
> To some degree, but you still have to walk the index for
> each record in "log" time. If you have everything on the
> same row it is constant time.

I'm not so sure about that. You are assuming that the DB must do a binary search. I see no reason why foreign keys couldn't be stored as direct memory locations in the DBMS.

Slava Imeshev

Posts: 114
Nickname: imeshev
Registered: Sep, 2004

Re: Database Denormalization and the NoSQL Movement

Posted: Sep 15, 2009 1:27 PM

Reply

> Obasanjo claims the driving force behind this trend is the
> social features of today's web sites:
> </p>
>
> <blockquote>
>
> Today, lots of Web applications have "social" features.
>
</blockquote>

A half of a dozen sites dealing with a particular problem is not "lots". I think the author needs to support his statement.

On a general note, it is our responsibility as engineers to pick right tools for the job. Dismissing relational technology as a whole just because there is very narrow use case that doesn't need it doesn't seem right.

Regards,

Slava Imeshev

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement

Posted: Sep 18, 2009 2:57 PM

Reply

@What do you think of this idea of a NoSQL movement? To what extent do you think this trend is overhyped because the large web sites that need to denormalize are well known, successful sites? Do you plan to use a key/value store instead of a relational database in the near future? If so, why did you decide to go that route?

Ahhaha.

Love the hints - "overhyped", Bill.

I agree with Robert Young, although I would not build directly from a DBMS vendor's catalog - that is tight coupling and will result in multiple levels of bad design consequences. Tools like Andromeda (which I've read the full code for) are really a step away from the catalog and toward defining a canonical model based on Codd's relational theory. The Andromeda mailing list has mentioned adding support for Date/Darwen style semantics as well.

Another approach is Jonathan Edwards' Coherence project, and another is JP Ebys' Trellis project.

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement

Posted: Sep 18, 2009 2:59 PM

Reply

@STEC, Intel, Fusion-io, SanDisk are supplying drives to EMC, IBM, HP, Compellent, NetApp, 3Par, Sun and just about every vendor on the planet. The game has changed; there are New Rules (to quote Brother Maher).

Fusion-io is indeed ridiculous in its performance improvements of SQLServer databases.

Dare's article is him parroting (and not understanding) Microsoft researcher Pat Helland's comments about de-normalization. This is Pat's fault, though: his presentation sucked.

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement

Posted: Sep 18, 2009 3:03 PM

Reply

Let me clarify... if you follow Dare's blog, you'd know why I said he was parroting Pat.

What you get is people repeating dumb things like: http://blogs.msdn.com/pathelland/archive/2007/07/23/normalization-is-for-sissies.aspx

Also, it is theoretically possible and desirable to tune the physical schema of a RDBMS to support such hierarchical and "navigation-oriented" read/write patterns. See Adam Marcus's recent MIT masters thesis for navigation-oriented physical DBMS implementation.

This is not new stuff. What Pat is advocating is short-term solutions for companies that really need an answer "right now" due to scaling costs "right now". In exchange they are willing to be stuck with a mess of code "later" to compete "right now".

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement

Posted: Sep 18, 2009 3:08 PM

Reply

Also, let me emphasize that Pat's presentation conflates two concepts:

Denormalization and memoization/precomputation. In particular, his point about "normalization only matters if you do updates" is only true if you're talking about caching via precomputing query results. Such techniques aren't unique to RDBMSes, and OODBMses like Cache' use these as well (except the concept of an "index" in OODBMSes can often be quite and surprisingly different from ones in SQL DBMSes).

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement

Posted: Sep 18, 2009 3:14 PM

Reply

> > STEC, Intel, Fusion-io, SanDisk are supplying drives to
> > EMC, IBM, HP, Compellent, NetApp, 3Par, Sun and just about
> > every vendor on the planet. The game has changed; there
> > are New Rules (to quote Brother Maher).

> I would love that, but unfortunately the future hasn't
> arrived for me yet. Databases measured in TB are still a
> bit expensive on SSDs. Regardless, reading precalculated
> data will always be faster than on-the-fly. The hope is
> that in the future the speed difference will be small
> enough to be irrelevant in 99% of cases.

In what way are they still expensive? Cursors on TB-sized databases will certainly kill your performance. As Date mentions in What Not How, a rules engine should sit in front of the database, as it plays no role in actually storing and retrieving the data. Instead, it serves as a policy filter.

If you need to do things like precalculate, then you should explore other options as well - such as event streams captured from a service bus. This will give you the ability to know how often your data is becoming out-of-date. You can then reshuffle the physical database schema so that your writes are lumped together. To do this in a self-tuning manner, you'd want to reshuffle the physical storage details at least daily.

Now if you have a single table constantly being written to, and a cursor that is depending on heavily-written to values and heavily-read values, you will get some slowdowns and suchlike.

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement

Posted: Sep 18, 2009 3:34 PM

Reply

> I love the querying power of SQL databases; nothing IMO
> comes close to SQL queries on a modern DB. That said, SQL
> DBs are not speed daemons: table scans on single columns,
> multiple joins and aggregates can be slow. But unless speed
> is an issue I would go SQL.

Full table scans will be slow, and it is not the number of columns that is an issue. It simply is a restriction on the amount of main memory you've likely got available, and that full table scan on a large table will blow whatever contents of the db are in memory. Actually, with most RDBMSes, you'll see performance degradation for tables with many columns, mainly due to improper design or a data model that truly requires a lot of sparesly populated fields. In response, vendors are providing better physical storage schema characteristics to deal with sparsely populated tables. However, that is neither here nor there, since the idea of "denormalizing for 'social' features" Dare is talking about do not have such characteristics.

> I remember creating a denormalized table with the current
> total balance per client, because calculating the current
> balance based on all transactions for a client was just
> too slow.

Did you check the underlying physical data storage of the table? Why wasn't this problem profileable and then solveable using your DBMS's profiling tools? Etc. If you've done this, then a proper CBO can avoid a full table scan, or you can optimize the query with access path hints so that it avoids a full table scan. The real problem here then becomes writing maintainable SQL. This isn't actually a hard problem in my experience.

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement

Posted: Sep 18, 2009 3:48 PM

Reply

> To answer the question, if the data is BCNF, with proper
> constraints, then any application can, optionally, provide
> client editing. This can be done fully dynamically, reading
> the catalog and shipping constraints; more commonly the
> client side code (html/javascript/ajax/whatever) is
> generated from the catalog. Andromeda and sprox are two
> examples.

Thanks for the mention of sprox - never heard of it before. There are other examples of this, too. Zoho Creator actually allows you to build your own DSL for manipulating the contents of the database.

sprox, though, has some pretty clear flaws, in my humble opinion, that I do not want in my architecture. Observe:


from sprox.formbase import AddRecordForm
from formencode import Schema
from formencode.validators import FieldsMatch
from tw.forms import PasswordField, TextField

form_validator =  Schema(chained_validators=(FieldsMatch('password',
                                                         'verify_password',
                                                         messages={'invalidNoMatch':
                                                         'Passwords do not match'}),))
class RegistrationForm(AddRecordForm):
    __model__ = User
    __require_fields__     = ['password', 'user_name', 'email_address']
    __omit_fields__        = ['_password', 'groups', 'created', 'user_id', 'town_id']
    __field_order__        = ['user_name', 'email_address', 'display_name', 'password', 'verify_password']
    __base_validator__     = form_validator
    email_address          = TextField
    display_name           = TextField
    verify_password        = PasswordField('verify_password')

__omit_fields__ is pretty nasty, if it is what I think it is. I'd rather this be constructed using a computation, rather than a entity-attribute-value tuple (represented in this code as a list).

Yes, you want to omit fields, but usually the reason WHY is more important than the list of what to omit, itself. The list should therefore be constructed declaratively from WHY.

robert young

Posts: 361
Nickname: funbunny
Registered: Sep, 2003

Re: Database Denormalization and the NoSQL Movement

Posted: Sep 18, 2009 7:47 PM

Reply

> I would love that, but unfortunately the future hasn't
> arrived for me yet. Databases measured in TB are still a
> bit expensive on SSDs. Regardless, reading precalculated
> data will always be faster than on-the-fly. The hope is
> that in the future the speed difference will be small
> enough to be irrelevant in 99% of cases.

The point of SSD/multi machine implementation is that de/un normalized data is *the reason* data stores go to TB. Refactor to BCNF, and throw away an order or two of magnitude of datastore. That is the point of these New Rules; it's NOT just being sequentially faster, it's being randomly faster. That's the win which motivates proper design. It's what removes the Chinese Wall between "logical design" and "physical design". Those who haven't figured that out should ruminate further on the very notion of why two versions of "design" have been defined.

art src

Posts: 33
Nickname: articulate
Registered: Sep, 2005

Re: Database Denormalization and the NoSQL Movement

Posted: Sep 21, 2009 4:34 PM

Reply

> Refactor to BCNF, and throw away an order or two of
> magnitude of datastore. That is the point of these New
> Rules; it's NOT just being sequentially faster, it's being
> randomly faster. That's the win which motivates proper
> design.

Since computers are so fast you can (and often should) use a solution that is not optimal from a performance perspective. As in your example, sometimes the 'right' solution is even faster than the 'wrong' one.

New rules might move the bar, but they don't take away the overhead. Storing Order Items physically with Orders will be faster when they are accessed together. One access of an SSD is still faster than 2 or 10 or whatever it takes to re-assemble parts.

> It's what removes the Chinese Wall between
> "logical design" and "physical design". Those who haven't
> figured that out should ruminate further on the very
> notion of why two versions of "design" have been defined.

The 1's and 0's on the disc are different than the logical model of relations, tuples, domains etc. That is a good thing. A key point in the relational model is to present the database user with a single logical model, and choices for physical implementation.

SQL databases have failed to implement the relational model.
They have failed to allow efficient physical clustering of objects while maintaining relational logical model. This is clear in "The Relation Model for large shared Data banks" the original paper where Codd enumerates every single clustering for parts and projects, says the logical model should be independent of the physical model.

Those with large personal investments in their SQL skills claim that decades of SQL dominion show that the relational model is an essential technology. They berate whippersnappers with their fangled NoSQL tools. Ironically the decades of the SQL tyranny actually show that the relational ideas, while attractive to me, are not a prerequisite data management ascendancy.

Flat View: This topic has 53 replies on 4 pages [ « | 1 2 3 4 | » ]

Previous Topic

Next Topic

Sponsored Links

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use