The Artima Developer Community
Sponsored Link

Artima Developer Spotlight Forum
Database Denormalization and the NoSQL Movement

53 replies on 4 pages. Most recent reply: Sep 28, 2009 8:56 AM by John Zabroski

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 53 replies on 4 pages [ « | 1 2 3 4 | » ]
John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement Posted: Sep 22, 2009 7:14 AM
Reply to this message Reply
Advertisement
For an extreme transactional system, you would be right - storing Orders and Order Items together is a matter of life-or-death for a system that has to complete a complex securities trade as fast as possible. Locality of reference will definitely be the bottleneck in this sort of system. However, most websites are using MySQL, which compared to other open source DBMSs like PostreSQL, is optimized for reads. These websites are not processing security exchange transactions.

Storing Orders with Order Items turns out to be a physical navigation problem, and so it is just a matter of clustering data together. You may want to read Adam Marcus's recent Masters thesis on BlendDB for more details on navigation-friendly RDBMSes. Of course, people wrote similar things in the '80s, but back then the focus was more on query optimization based on pure logical models of data.

By fully separating physical design from logical design, you are doing One Fact, One Place, One Time - all the way down.

robert young

Posts: 361
Nickname: funbunny
Registered: Sep, 2003

Re: Database Denormalization and the NoSQL Movement Posted: Sep 22, 2009 4:10 PM
Reply to this message Reply
> New rules might move the bar, but they don't take away the
> overhead. Storing Order Items physically with Orders will
> be faster when they are accessed together. One access of
> an SSD is still faster than 2 or 10 or whatever it takes
> to re-assemble parts.

No. It's still one access, but rather than having to scan the de/un normalized flat file image to find the rows, the engine scans only the subset needed for the query. To put it concretely, a 1,000,000 row flat file image can be stored (assuming for the moment that a cross join is sufficient) in two 1,000 row tables. For the flat file, you would have to scan the entire index (assuming access has an index) each time. You have to scan the entire row(s) to find the columns. With the two table implementation, you pick only the rows you need. Since this is SSD, running an a multi-core/processor machine, the engine retrieves the Order and Order Items in parallel, merging on output.


> They have failed to allow efficient physical clustering of
> objects while maintaining relational logical model.

Not even close to the truth. Nothing in the Relational Model demands any physical implementation. That choice was intentional. Codd had seen what IMS did: implement a simple hierarchical structure (pre-saging sgml/xml as data store) for which a specification was created post-hoc. This post-hoc approach is manifest in the current fascinating xml data stores. They also mix data with meta data, but that's another story.


> Those with large personal investments in their SQL skills
> claim that decades of SQL dominion show that the
> relational model is an essential technology.

Only some SQL database vendors have the temerity to claim that such engines are Relational. Factoid: Chamberlin of XQuery fame is also the perpetrator of SQL; Codd was explicitly not allowed to implement his data language.

> They berate
> whippersnappers with their fangled NoSQL tools.

Be specific: Olde Fangled. If you like Olde Fangled, fine, but that doesn't mean that NoSql (et al) represent progress in data modeling. Sorry.


> Ironically the decades of the SQL tyranny actually show
> w that the relational ideas, while attractive to me, are
> not a prerequisite data management ascendancy.

Why is it tyranny? Other than you sound annoyed? If you want a technical treatment of data modeling, I suggest "Applied Mathematics for Database Professionals", de Haan and Koppelaars.

OTOH, I found (virtually) universally that those taking the xml/flat approach to data are inherently coders. Which makes perfect sense: the xml/flat approach requires gobs of lines of application code to manage all those bytes, and since there is all that application code managing bytes, such applications are nicely siloed from the outside world (and other applications which might want to use the data), thus necessitating gobs more code in order to communicate the data between any needful applications. Which is *precisely* the problem Dr. Codd recognized and set out to solve. He did. Those who turn their noses up at what he invented are reactionary, in the political/historical sense.

robert young

Posts: 361
Nickname: funbunny
Registered: Sep, 2003

Re: Database Denormalization and the NoSQL Movement Posted: Sep 22, 2009 4:28 PM
Reply to this message Reply
> For an extreme transactional system, you would be right -
> storing Orders and Order Items together is a matter of
> life-or-death for a system that has to complete a complex
> securities trade as fast as possible.

No. For a Trust in the Rust(tm) data store, then there may be cases. But for the SSD machine, then quite explicitly, no. For the SSD machine, the BCNF image will be the most parsimonious (one might call it the minimal cover) number of bytes; nothing is duplicated. It will write the smallest amount of bytes, whether stored on rust or in silicon. It will implement the shortest transaction duration, on SSD. It will lock any indexes for the shortest amount of time, on SSD. It will lock the fewest index entries, on SSD. And so on. (Most of those statements are also true even when one Trusts in the Rust(tm), only slower.)


> Locality of
> reference will definitely be the bottleneck in this sort
> of system.

Only for Trust in the Rust(tm) stores. Not for SSD; locality doesn't matter, that's the whole point.


> Storing Orders with Order Items turns out to be a physical
> navigation problem, and so it is just a matter of
> clustering data together.

Again, only when one Trusts in the Rust(tm).


You may want to read Adam
> Marcus's recent Masters thesis on BlendDB for more details
> on navigation-friendly RDBMSes.

Again, not a concern with the SSD multi-machines. IMS and other such hierarchical data stores are only more efficient, when one Trusts in the Rust(tm), for one access path (which is baked into the hierarchy definition, and it is merely a definition not a model; again, an error Dr. Codd recognized at set out to solve).

Of course, people wrote
> similar things in the '80s,

no. The 1960's, IMS was released in 1966 for Apollo.

but back then the focus was
> more on query optimization based on pure logical models of
> data.

no. Based on one-off physical definitions.

>
> By fully separating physical design from logical design,
> you are doing One Fact, One Place, One Time - all the way
> down.

Now you're talkin' my talk. The advantage of SSD multi- machines is that the logical and physical models are identical. This does strip out lots of data, and all that application code needed for management. The engine does the management.

I'm not kidding when I say that these machines are completely disruptive. They will make gobs of coders redundant, and they're already whining.

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement Posted: Sep 22, 2009 4:51 PM
Reply to this message Reply
Robert,

CBOs came around in the '80s. People were realizing the advantages of RDBMSes, but still stuck with Codasyl and other similarly designed systems, so researchers like Gray figured out ways to allow DB admins to write purely relational queries on a hierarchical DBMS. It required setting up substantial plumbing, much more so than SQL physical schema definition constructs require today.

My comments on physical schema, if you notice, were largely targeted towards recommending temporal constraints on how the data is coming into the system.

I wasn't aware of SQL DBMS engines *not* building indexes, though, and only using the management engine to decide access to data?

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement Posted: Sep 22, 2009 5:02 PM
Reply to this message Reply
By the way, something not mentioned so far in this thread,

is that these days it is very rare where we are I/O bound. We can't get enough processors or enough network bandwidth. The constant speed of light on the network, is probably our biggest bottleneck. I wish we had the $$$ to distribute out and do located caching, but we don't. However, low defect rates keep customer satisfaction at 100%. Something I don't think NoSql folks are prepared to back up.

The other bottleneck I face today is modeling. Changing around the logical design is a bit of the pain in the ass, mainly due to the fact first-rate modeling tools poorly integrate with source languages.

So these are the problems I need solved, in order to keep my customers happy. Sorry, NoSql.

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement Posted: Sep 22, 2009 5:04 PM
Reply to this message Reply
Er, I might add, when the network is your bottleneck, improper protocol design will be THE source of propagation delay. Efficient exchange of data, between client-server or client-gateway-gateway-client, is a big thing.

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement Posted: Sep 22, 2009 5:15 PM
Reply to this message Reply
BTW, Robert, one last question...

If coders are less valuable... then what is your argument for why the DBA is more valuable? Seems like ever since the .COM-Bust you guys have been making much less, taking much bigger paycuts than developers. Do you see SSD further cutting into your salary?

The only thing I can think of is that you'd argue proper BCNF design is even more important now, so it filters out bad DBAs from good DBAs?

art src

Posts: 33
Nickname: articulate
Registered: Sep, 2005

Re: Database Denormalization and the NoSQL Movement Posted: Sep 22, 2009 7:01 PM
Reply to this message Reply
> No. It's still one access, but rather than having to scan
> the de/un normalized flat file image to find the rows, the
> engine scans only the subset needed for the query.

You say one then...


> To put
> it concretely, a 1,000,000 row flat file image can be
> stored (assuming for the moment that a cross join is
> sufficient) in two 1,000 row tables. For the flat file,
> you would have to scan the entire index (assuming access
> has an index) each time. You have to scan the entire
> row(s) to find the columns. With the two table
> implementation, you pick only the rows you need.

But there are two tables, so I say two.


> Since
> this is SSD, running an a multi-core/processor machine,
> the engine retrieves the Order and Order Items in
> parallel, merging on output.

This sounds like two physical retrieves, one for the order and one for the items. This is if you are lucky and the items for the order clustered together. Plus an operation to connect the two. Of course since this is an SQL database this is a BTree and you are jumping around index pages too.
Plus there is the merge operation. And SQL databases are good at all this, but there is a reason they are the slowest things in most applications.


> > They have failed to allow efficient physical clustering
> of
> > objects while maintaining relational logical model.
>
> Not even close to the truth. Nothing in the Relational
> Model demands any physical implementation. That choice
> was intentional. Codd had seen what IMS did: implement a
> simple hierarchical structure (pre-saging sgml/xml as data
> store) for which a specification was created post-hoc.

I think the author of the relational model want some physical implementation. He said they must maintain logical model when physical model changes. This implies a physical model that can change in some ways.

The relation model requires queries not break when the physical model changes. Solving this problem by banning any changes to the physical model would give any system this property. So I argue that a truly relation system allows a variety of physical models with one logic model. SQL system do this well with indexes, where new indexes are often transparently used.

I hope you now agree that storing Orders and Order Items independently requires two retrieves, rather than one, if they were stored together.

Why can't I have the single retrieve physical model share the same logical model as other physical implementations, where for example the order items are stored separately from orders.

People do de-normalize the physical model for read performance. The promise of the relation model is that you can do optimization of the physical model and clients don't change because the logical model does not change. The SQL products don't support this kind of thing well. You could do is create a materialized view with the de-normalized model and if SQL optimizer would use it that would solve the problem. I don't think the SQL optimizers do that.


> This post-hoc approach is manifest in the current
> t fascinating xml data stores. They also mix data with
> meta data, but that's another story.
>
>
> > Those with large personal investments in their SQL
> skills
> > claim that decades of SQL dominion show that the
> > relational model is an essential technology.
>
> Only some SQL database vendors have the temerity to claim
> that such engines are Relational. Factoid: Chamberlin of
> XQuery fame is also the perpetrator of SQL; Codd was
> explicitly not allowed to implement his data language.

The inventor of relational theory has stated on a number of occasions that SQL database are not relational. This does not prevent people from talking about relational databases as though they exist:

http://www.reddit.com/r/programming/comments/9ijmr/help_pick_a_better_name_for_nosql_movement_at/c0cwvf6

I am saying:

1. SQL database are not relational in significant ways.
2. SQL database are useful
3. Therefore some non relational databases are useful

art src

Posts: 33
Nickname: articulate
Registered: Sep, 2005

Re: Database Denormalization and the NoSQL Movement Posted: Sep 22, 2009 9:52 PM
Reply to this message Reply
> For an extreme transactional system, you would be right -
> storing Orders and Order Items together is a matter of
> life-or-death for a system that has to complete a complex
> securities trade as fast as possible. Locality of
> reference will definitely be the bottleneck in this sort
> of system. However, most websites are using MySQL, which
> compared to other open source DBMSs like PostreSQL, is
> optimized for reads. These websites are not processing
> security exchange transactions.

If you are in a situation where you can use 20 machines for a feature with locality, and 10 without you can save quite a bit power etc. This can make the difference between a feature being economically worthwhile and not.

I have heard enterprise developers suggest web sites use enterprise technology like a big SQL servers or even mainframes. While those systems might be able to handle the load web sites have, they are not very efficient and often don't fit the cost structures of online properties for some use cases.


> By fully separating physical design from logical design,
> you are doing One Fact, One Place, One Time - all the way
> down.

By using SQL and an application language I tend to do one fact, 2+ places and a mapping.

robert young

Posts: 361
Nickname: funbunny
Registered: Sep, 2003

Re: Database Denormalization and the NoSQL Movement Posted: Sep 23, 2009 6:22 AM
Reply to this message Reply
> BTW, Robert, one last question...
>
> If coders are less valuable... then what is your argument
> for why the DBA is more valuable?

The term DBA is ambiguous, and always has been. Ranges from a guy who changes tapes in the backup to one who grants permissions to another who designs the schema. The ones I'm concerned with are the latter. Some places use the explicit term Database Developer, but most still lump the responsibilities into the DBA title. In the Brave New World with New Rules, BCNF data and a screen code generator become the development paradigm. In that paradigm, the database designer has most of the responsibility; the coder tweaks the generated code to, say, change the list box to a radio set or whatever. Pixel dust maintenance, basically.


> Seems like ever since
> the .COM-Bust you guys have been making much less, taking
> much bigger paycuts than developers. Do you see SSD
> further cutting into your salary?

On the contrary. The backup oriented DBAs may feel some heat, what with storage being just a dozen RAID 1 SSD rather than 5,000 HDD in RAID 10 or RAID 5. The management functions, grants and reorgs and the like are still needed.


> The only thing I can think of is that you'd argue proper
> BCNF design is even more important now, so it filters out
> bad DBAs from good DBAs?

Filters out repetitive work, both in code and DBA. BCNF design drives the application: code from the screen generator doesn't require lots of hand work, the database doesn't require lots of triggers for maintenance, etc.. IIRC, Hibernate is beginning to go that way, too. It is logical, to paraphrase a Vulcan of some repute.

robert young

Posts: 361
Nickname: funbunny
Registered: Sep, 2003

Re: Database Denormalization and the NoSQL Movement Posted: Sep 23, 2009 7:00 AM
Reply to this message Reply
> But there are two tables, so I say two.

Again, when stored on SSD, no matter what the join type (nested loop, sort/merge, etc.) retrieving the parts of the Full Order happens all at once for each Order; it's as if the Order is stored "together" since physical location of bytes on the dies is irrelevant. In fact, the location of bytes is always changing as the controller moves data for wear leveling.


> > Since
> > this is SSD, running an a multi-core/processor machine,
> > the engine retrieves the Order and Order Items in
> > parallel, merging on output.
>
> This sounds like two physical retrieves, one for the order
> and one for the items. This is if you are lucky and the
> items for the order clustered together.

Clustering is not relevant on the SSD; all accesses are equally timed, random or sequential doesn't matter. This is one of the advantages to the SSD. The controller routinely moves data about on the dies for wear leveling purposes; you never know from one microsecond to the next where the physical location of bytes are, and it doesn't matter.


> Plus an operation
> to connect the two. Of course since this is an SQL
> database this is a BTree and you are jumping around index
> pages too.
> Plus there is the merge operation. And SQL databases are
> good at all this, but there is a reason they are the
> slowest things in most applications.

They are slow due to access when you Trust only in the Rust(tm). Data retrieval in SSD changes the equation.


> I hope you now agree that storing Orders and Order Items
> independently requires two retrieves, rather than one, if
> they were stored together.

I don't agree when run on an SSD multi-machine with an industrial strength engine (MySql is not). For such a machine and such an engine, a query plan for a join picks up the Items with each Order simultaneously. Again, with SSD, there is no guarantee that data will be stored in a clustered fashion in the silicon. What the controller presents to the OS and application is not what's on the silicon. The redirection of bad blocks in HDD is a simplified version of what goes on with SSD. With SSD, the controller moves data as needed.


> Why can't I have the single retrieve physical model share
> the same logical model as other physical implementations,
> where for example the order items are stored separately
> from orders.

If you implement "single retrieve physical model", by which I assume you mean a de/un normalized "row" with Order/Order Item columns for each Item, you get all the problems of flat files. So no, you can't have it both ways.


> People do de-normalize the physical model for read
> performance. The promise of the relation model is that
> you can do optimization of the physical model and clients
> don't change because the logical model does not change.
> The SQL products don't support this kind of thing well.
> . You could do is create a materialized view with the
> de-normalized model and if SQL optimizer would use it that
> would solve the problem. I don't think the SQL optimizers
> do that.
>

All of this is based on the assumption of the Trust Only on Rust(tm) stored paradigm. The SSD paradigm is different. There is no join penalty. So, nothing to be gained for Read Only Access of Order/Order Item, whether as base table or materialized view.


> I am saying:
>
> 1. SQL database are not relational in significant ways.
> 2. SQL database are useful
> 3. Therefore some non relational databases are useful

Your therefore does not follow from your axioms. What follows from your axioms: build a truly relational language to replace SQL (there are).

What coders find useful about xml/flat structures is the same as COBOL coders did with IMS in 1966: they can embed ONE specific access path useful to the code in the data structure. That is not a reason to abandon an 85% relational solution (SQL). With SSD (and CTE supportive databases), there is no longer any reason to bow to xml/flat data storage. And this is apples and oranges, anyway.

The failure of SQL to be faithful to Codd is not what drives coders to xml, or what motivates the join penalty (so called) with Trust Only the Rust(tm) storage. Not writing joins (SQL language or otherwise) is a reaction to a physical implementation failure, which no longer applies with SSD multi- machines. Coders flocked to xml because it meant they could wrest control of the data into each application's own little silo, just as COBOL coders had it in the 1960's with IMS. There was no intellectual decision, only a hegemonic one. That, and few if any coders have ever read Codd or Date or Gray, much less understood them. Try Weikum and Vossen for a true mind bender.

John Zabroski

Posts: 272
Nickname: zbo
Registered: Jan, 2007

Re: Database Denormalization and the NoSQL Movement Posted: Sep 23, 2009 8:38 AM
Reply to this message Reply
> > By fully separating physical design from logical design,
> > you are doing One Fact, One Place, One Time - all the way
> > down.

> By using SQL and an application language I tend to do one
> fact, 2+ places and a mapping.

I don't think you understand time, or the importance of it, then.

How are you synchronizing deployment upgrades? Oh, wait, I get it, you are doing the "Fortune 500 quality controls" model where it takes 6 weeks to make a change "Safely" and then FUBAR! something still goes wrong?

That is not only unnecessary it is downright counterproductive.

Yesterday I read a Linux Mag article about MongoDB. The lead developer, Eliot Horowitz, completely misused the word "schema-free". What he was actually saying was that MongoDB does not support an online data dictionary, and he was trying to argue that this was somehow a virtue. Too funny.

Alpha Dog

Posts: 7
Nickname: alphadog
Registered: Dec, 2008

Re: Database Denormalization and the NoSQL Movement Posted: Sep 23, 2009 1:31 PM
Reply to this message Reply
> > > By fully separating physical design from logical
> design,
> > > you are doing One Fact, One Place, One Time - all the
> way
> > > down.
>
> > By using SQL and an application language I tend to do
> one
> > fact, 2+ places and a mapping.
>
> I don't think you understand time, or the importance of
> it, then.
>
> How are you synchronizing deployment upgrades? Oh, wait,
> I get it, you are doing the "Fortune 500 quality controls"
> model where it takes 6 weeks to make a change "Safely" and
> then FUBAR! something still goes wrong?
>
> That is not only unnecessary it is downright
> counterproductive.
>
> Yesterday I read a Linux Mag article about MongoDB. The
> lead developer, Eliot Horowitz, completely misused the
> word "schema-free". What he was actually saying was that
> MongoDB does not support an online data dictionary, and he
> was trying to argue that this was somehow a virtue. Too
> funny.

Yes! That article is a shame. Lines like "While often called the NoSQL movement, the need for new technologies is caused by the relational model, rather than SQL" always make me bow my head in shame for my profession.

It's at best the complete opposite of reality.

robert young

Posts: 361
Nickname: funbunny
Registered: Sep, 2003

Re: Database Denormalization and the NoSQL Movement Posted: Sep 23, 2009 6:06 PM
Reply to this message Reply
> Yes! That article is a shame. Lines like "While often
> called the NoSQL movement, the need for new technologies
> is caused by the relational model, rather than SQL" always
> make me bow my head in shame for my profession.

Man the barricades, take up arms against the oppressors, and so on. Sometimes a mob of knuckleheads, like lemmings, get what they wish for.

art src

Posts: 33
Nickname: articulate
Registered: Sep, 2005

Re: Database Denormalization and the NoSQL Movement Posted: Sep 24, 2009 5:43 AM
Reply to this message Reply
> Not even close to the truth. Nothing in the Relational
> Model demands any physical implementation.

But the relational model does demand that I have the relational logical view on top of a variety of physical implementations, and that the relational view must not change when the physical layer does. Developers have been telling us what physical layer they want by the way they de-normalize. Good support for a relational model would allow that physical model and would hide it properly from the queries, which should use the BCNF model.


> If you implement "single retrieve physical model", by
> which I assume you mean a de/un normalized "row" with
> Order/Order Item columns for each Item, you get all the
> problems of flat files. So no, you can't have it both
> ways.

How about I create a materialized view which contains the joins of orders and items. Then I create an optimizer smart enough so that when I write the join and use the BCNF model it uses the materialized view. Is that having it both ways? Does not sound very difficult at all.



> All of this is based on the assumption of the Trust Only
> on Rust(tm) stored paradigm. The SSD paradigm is
> different. There is no join penalty. So, nothing to be
> gained for Read Only Access of Order/Order Item, whether
> as base table or materialized view.
>
>
> > I am saying:
> >
> > 1. SQL database are not relational in significant ways.
> > 2. SQL database are useful
> > 3. Therefore some non relational databases are useful
>
> Your therefore does not follow from your axioms. What
> follows from your axioms: build a truly relational
> language to replace SQL (there are).

Building a truly relational system has been obvious for a long time. Since it does not solve any significant problems so it a a route to failure. Building a non-relational scalable system like BigTable does solve real problems, so it has been important.

> The failure of SQL to be faithful to Codd is not what
> drives coders to xml,
> or what motivates the join penalty
> (so called) with Trust Only the Rust(tm) storage. Not
> writing joins (SQL language or otherwise) is a reaction to
> a physical implementation failure, which no longer applies
> with SSD multi- machines.

Many joins have been entirely in memory for a long time.

This stuff is from the last century "You can speed up SQL SELECTs simply by buying so much RAM that the entire database is in memory."

http://philip.greenspun.com/wtr/dead-trees/53011.htm

The durability of SSD's has some advantages, but in terms of access time, they don't beat memory. I don't see many people caching their entire database then re-normalizing and they have had many years to do it. Compared to disc, SSD's look good. Compared to memory cache + disc? Ho-hum nothing disruptive there.


> Coders flocked to xml because
> it meant they could wrest control of the data into each
> application's own little silo, just as COBOL coders had it
> in the 1960's with IMS. There was no intellectual
> decision, only a hegemonic one. That, and few if any
> coders have ever read Codd or Date or Gray, much less
> understood them. Try Weikum and Vossen for a true mind
> bender.

I think more people have read Codd that most other computer writing.

Flat View: This topic has 53 replies on 4 pages [ « | 1  2  3  4 | » ]
Topic: Scala's Stackable Trait Pattern Previous Topic   Next Topic Topic: Grand Central Dispatch: Apple's OS-Based Approach to Multicore Programming

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use