The Artima Developer Community
Sponsored Link

Artima Developer Spotlight Forum
Database Denormalization and the NoSQL Movement

53 replies on 4 pages. Most recent reply: Sep 28, 2009 8:56 AM by John Zabroski

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 53 replies on 4 pages [ « | 1 2 3 4 | » ]
robert young

Posts: 361
Nickname: funbunny
Registered: Sep, 2003

Re: Database Denormalization and the NoSQL Movement Posted: Sep 14, 2009 1:27 PM
Reply to this message Reply
Advertisement
> I love the querying power of SQL databases; nothing IMO
> comes close to SQL queries on a modern DB. That said, SQL
> DBs are not speed daemons: table scans on single columns,
> multiple joins and aggregates can be slow. But unless
> speed is an issue I would go SQL.
>
> I remember creating a denormalized table with the current
> total balance per client, because calculating the current
> balance based on all transactions for a client was just
> too slow.

As I have been saying for some years, and in the last 6 months it has come to be true for those who pay attention, SSD multi-core multi-processor machines remove any perceived join penalty. BCNF on such machines is faster than un/de normalized data on any machine.

STEC, Intel, Fusion-io, SanDisk are supplying drives to EMC, IBM, HP, Compellent, NetApp, 3Par, Sun and just about every vendor on the planet. The game has changed; there are New Rules (to quote Brother Maher).

Mark Thornton

Posts: 275
Nickname: mthornton
Registered: Oct, 2005

Re: Database Denormalization and the NoSQL Movement Posted: Sep 15, 2009 1:09 AM
Reply to this message Reply
> As I have been saying for some years, and in the last 6
> months it has come to be true for those who pay attention,
> SSD multi-core multi-processor machines remove any
> perceived join penalty. BCNF on such machines is faster
> than un/de normalized data on any machine.

One of my awkward cases is storing matrices. The normalised form has rows containing row key, column key, and data value. Where the data value is small (a float or two) specialised structures are many times faster than any database I have yet seen.

nes

Posts: 137
Nickname: nn
Registered: Jul, 2004

Re: Database Denormalization and the NoSQL Movement Posted: Sep 15, 2009 7:54 AM
Reply to this message Reply
> > multiple joins and aggregates can be slow. But unless
> > speed is an issue I would go SQL.
>
> Doesn't using proper indexes and foreign keys address
> this?

To some degree, but you still have to walk the index for each record in "log" time. If you have everything on the same row it is constant time.

> my experience is that straight queries using JDBC
> are an order of magnitude faster than Hibernate.

I don't have much experience with ORMs. I always thought they were used more for convenience than for speed.

nes

Posts: 137
Nickname: nn
Registered: Jul, 2004

Re: Database Denormalization and the NoSQL Movement Posted: Sep 15, 2009 8:08 AM
Reply to this message Reply
> > multiple joins and aggregates can be slow. But unless
> > speed is an issue I would go SQL.
> >
> As I have been saying for some years, and in the last 6
> months it has come to be true for those who pay attention,
> SSD multi-core multi-processor machines remove any
> perceived join penalty. BCNF on such machines is faster
> than un/de normalized data on any machine.
>
> STEC, Intel, Fusion-io, SanDisk are supplying drives to
> EMC, IBM, HP, Compellent, NetApp, 3Par, Sun and just about
> every vendor on the planet. The game has changed; there
> are New Rules (to quote Brother Maher).

I would love that, but unfortunately the future hasn't arrived for me yet. Databases measured in TB are still a bit expensive on SSDs. Regardless, reading precalculated data will always be faster than on-the-fly. The hope is that in the future the speed difference will be small enough to be irrelevant in 99% of cases.

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Database Denormalization and the NoSQL Movement Posted: Sep 15, 2009 8:34 AM
Reply to this message Reply
> > > multiple joins and aggregates can be slow. But unless
> > > speed is an issue I would go SQL.
> >
> > Doesn't using proper indexes and foreign keys address
> > this?
>
> To some degree, but you still have to walk the index for
> each record in "log" time. If you have everything on the
> same row it is constant time.

I'm not so sure about that. You are assuming that the DB must do a binary search. I see no reason why foreign keys couldn't be stored as direct memory locations in the DBMS.

Slava Imeshev

Posts: 114
Nickname: imeshev
Registered: Sep, 2004

Re: Database Denormalization and the NoSQL Movement Posted: Sep 15, 2009 1:27 PM
Reply to this message Reply
> Obasanjo claims the driving force behind this trend is the
> social features of today's web sites:
> </p>
>
> <blockquote>
>
> Today, lots of Web applications have "social" features.
>
</blockquote>

A half of a dozen sites dealing with a particular problem is not "lots". I think the author needs to support his statement.

On a general note, it is our responsibility as engineers to pick right tools for the job. Dismissing relational technology as a whole just because there is very narrow use case that doesn't need it doesn't seem right.

Regards,

Slava Imeshev

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement Posted: Sep 18, 2009 2:57 PM
Reply to this message Reply
@What do you think of this idea of a NoSQL movement? To what extent do you think this trend is overhyped because the large web sites that need to denormalize are well known, successful sites? Do you plan to use a key/value store instead of a relational database in the near future? If so, why did you decide to go that route?

Ahhaha.

Love the hints - "overhyped", Bill.

I agree with Robert Young, although I would not build directly from a DBMS vendor's catalog - that is tight coupling and will result in multiple levels of bad design consequences. Tools like Andromeda (which I've read the full code for) are really a step away from the catalog and toward defining a canonical model based on Codd's relational theory. The Andromeda mailing list has mentioned adding support for Date/Darwen style semantics as well.

Another approach is Jonathan Edwards' Coherence project, and another is JP Ebys' Trellis project.

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement Posted: Sep 18, 2009 2:59 PM
Reply to this message Reply
@STEC, Intel, Fusion-io, SanDisk are supplying drives to EMC, IBM, HP, Compellent, NetApp, 3Par, Sun and just about every vendor on the planet. The game has changed; there are New Rules (to quote Brother Maher).

Fusion-io is indeed ridiculous in its performance improvements of SQLServer databases.

Dare's article is him parroting (and not understanding) Microsoft researcher Pat Helland's comments about de-normalization. This is Pat's fault, though: his presentation sucked.

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement Posted: Sep 18, 2009 3:03 PM
Reply to this message Reply
Let me clarify... if you follow Dare's blog, you'd know why I said he was parroting Pat.

What you get is people repeating dumb things like: http://blogs.msdn.com/pathelland/archive/2007/07/23/normalization-is-for-sissies.aspx

Also, it is theoretically possible and desirable to tune the physical schema of a RDBMS to support such hierarchical and "navigation-oriented" read/write patterns. See Adam Marcus's recent MIT masters thesis for navigation-oriented physical DBMS implementation.

This is not new stuff. What Pat is advocating is short-term solutions for companies that really need an answer "right now" due to scaling costs "right now". In exchange they are willing to be stuck with a mess of code "later" to compete "right now".

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement Posted: Sep 18, 2009 3:08 PM
Reply to this message Reply
Also, let me emphasize that Pat's presentation conflates two concepts:

Denormalization and memoization/precomputation. In particular, his point about "normalization only matters if you do updates" is only true if you're talking about caching via precomputing query results. Such techniques aren't unique to RDBMSes, and OODBMses like Cache' use these as well (except the concept of an "index" in OODBMSes can often be quite and surprisingly different from ones in SQL DBMSes).

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement Posted: Sep 18, 2009 3:14 PM
Reply to this message Reply
> > STEC, Intel, Fusion-io, SanDisk are supplying drives to
> > EMC, IBM, HP, Compellent, NetApp, 3Par, Sun and just about
> > every vendor on the planet. The game has changed; there
> > are New Rules (to quote Brother Maher).

> I would love that, but unfortunately the future hasn't
> arrived for me yet. Databases measured in TB are still a
> bit expensive on SSDs. Regardless, reading precalculated
> data will always be faster than on-the-fly. The hope is
> that in the future the speed difference will be small
> enough to be irrelevant in 99% of cases.

In what way are they still expensive? Cursors on TB-sized databases will certainly kill your performance. As Date mentions in What Not How, a rules engine should sit in front of the database, as it plays no role in actually storing and retrieving the data. Instead, it serves as a policy filter.

If you need to do things like precalculate, then you should explore other options as well - such as event streams captured from a service bus. This will give you the ability to know how often your data is becoming out-of-date. You can then reshuffle the physical database schema so that your writes are lumped together. To do this in a self-tuning manner, you'd want to reshuffle the physical storage details at least daily.

Now if you have a single table constantly being written to, and a cursor that is depending on heavily-written to values and heavily-read values, you will get some slowdowns and suchlike.

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement Posted: Sep 18, 2009 3:34 PM
Reply to this message Reply
> I love the querying power of SQL databases; nothing IMO
> comes close to SQL queries on a modern DB. That said, SQL
> DBs are not speed daemons: table scans on single columns,
> multiple joins and aggregates can be slow. But unless speed
> is an issue I would go SQL.

Full table scans will be slow, and it is not the number of columns that is an issue. It simply is a restriction on the amount of main memory you've likely got available, and that full table scan on a large table will blow whatever contents of the db are in memory. Actually, with most RDBMSes, you'll see performance degradation for tables with many columns, mainly due to improper design or a data model that truly requires a lot of sparesly populated fields. In response, vendors are providing better physical storage schema characteristics to deal with sparsely populated tables. However, that is neither here nor there, since the idea of "denormalizing for 'social' features" Dare is talking about do not have such characteristics.

> I remember creating a denormalized table with the current
> total balance per client, because calculating the current
> balance based on all transactions for a client was just
> too slow.

Did you check the underlying physical data storage of the table? Why wasn't this problem profileable and then solveable using your DBMS's profiling tools? Etc. If you've done this, then a proper CBO can avoid a full table scan, or you can optimize the query with access path hints so that it avoids a full table scan. The real problem here then becomes writing maintainable SQL. This isn't actually a hard problem in my experience.

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement Posted: Sep 18, 2009 3:48 PM
Reply to this message Reply
> To answer the question, if the data is BCNF, with proper
> constraints, then any application can, optionally, provide
> client editing. This can be done fully dynamically, reading
> the catalog and shipping constraints; more commonly the
> client side code (html/javascript/ajax/whatever) is
> generated from the catalog. Andromeda and sprox are two
> examples.

Thanks for the mention of sprox - never heard of it before. There are other examples of this, too. Zoho Creator actually allows you to build your own DSL for manipulating the contents of the database.

sprox, though, has some pretty clear flaws, in my humble opinion, that I do not want in my architecture. Observe:


from sprox.formbase import AddRecordForm
from formencode import Schema
from formencode.validators import FieldsMatch
from tw.forms import PasswordField, TextField

form_validator = Schema(chained_validators=(FieldsMatch('password',
'verify_password',
messages={'invalidNoMatch':
'Passwords do not match'}),))
class RegistrationForm(AddRecordForm):
__model__ = User
__require_fields__ = ['password', 'user_name', 'email_address']
__omit_fields__ = ['_password', 'groups', 'created', 'user_id', 'town_id']
__field_order__ = ['user_name', 'email_address', 'display_name', 'password', 'verify_password']
__base_validator__ = form_validator
email_address = TextField
display_name = TextField
verify_password = PasswordField('verify_password')


__omit_fields__ is pretty nasty, if it is what I think it is. I'd rather this be constructed using a computation, rather than a entity-attribute-value tuple (represented in this code as a list).

Yes, you want to omit fields, but usually the reason WHY is more important than the list of what to omit, itself. The list should therefore be constructed declaratively from WHY.

robert young

Posts: 361
Nickname: funbunny
Registered: Sep, 2003

Re: Database Denormalization and the NoSQL Movement Posted: Sep 18, 2009 7:47 PM
Reply to this message Reply
> I would love that, but unfortunately the future hasn't
> arrived for me yet. Databases measured in TB are still a
> bit expensive on SSDs. Regardless, reading precalculated
> data will always be faster than on-the-fly. The hope is
> that in the future the speed difference will be small
> enough to be irrelevant in 99% of cases.

The point of SSD/multi machine implementation is that de/un normalized data is *the reason* data stores go to TB. Refactor to BCNF, and throw away an order or two of magnitude of datastore. That is the point of these New Rules; it's NOT just being sequentially faster, it's being randomly faster. That's the win which motivates proper design. It's what removes the Chinese Wall between "logical design" and "physical design". Those who haven't figured that out should ruminate further on the very notion of why two versions of "design" have been defined.

art src

Posts: 33
Nickname: articulate
Registered: Sep, 2005

Re: Database Denormalization and the NoSQL Movement Posted: Sep 21, 2009 4:34 PM
Reply to this message Reply
> Refactor to BCNF, and throw away an order or two of
> magnitude of datastore. That is the point of these New
> Rules; it's NOT just being sequentially faster, it's being
> randomly faster. That's the win which motivates proper
> design.

Since computers are so fast you can (and often should) use a solution that is not optimal from a performance perspective. As in your example, sometimes the 'right' solution is even faster than the 'wrong' one.

New rules might move the bar, but they don't take away the overhead. Storing Order Items physically with Orders will be faster when they are accessed together. One access of an SSD is still faster than 2 or 10 or whatever it takes to re-assemble parts.


> It's what removes the Chinese Wall between
> "logical design" and "physical design". Those who haven't
> figured that out should ruminate further on the very
> notion of why two versions of "design" have been defined.

The 1's and 0's on the disc are different than the logical model of relations, tuples, domains etc. That is a good thing. A key point in the relational model is to present the database user with a single logical model, and choices for physical implementation.

SQL databases have failed to implement the relational model.
They have failed to allow efficient physical clustering of objects while maintaining relational logical model. This is clear in "The Relation Model for large shared Data banks" the original paper where Codd enumerates every single clustering for parts and projects, says the logical model should be independent of the physical model.

Those with large personal investments in their SQL skills claim that decades of SQL dominion show that the relational model is an essential technology. They berate whippersnappers with their fangled NoSQL tools. Ironically the decades of the SQL tyranny actually show that the relational ideas, while attractive to me, are not a prerequisite data management ascendancy.

Flat View: This topic has 53 replies on 4 pages [ « | 1  2  3  4 | » ]
Topic: Scala's Stackable Trait Pattern Previous Topic   Next Topic Topic: Grand Central Dispatch: Apple's OS-Based Approach to Multicore Programming

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use