Sponsored Link •
Robert McIntosh wrote a thought-provoking piece on designing a scalable Web application without a database. I share three reasons why such a notion deserves some merit.
In questioning widely shared wisdom about building data-driven Web sites, Robert McIntosh penned a thought-provoking, albeit controversial, piece in Building a high volume app without a RDMS or Domain objects.
McIntosh's basic thesis is centered around three observations. The first one is that true scalability can best be achieved in a shared-nothing architecture. Not all applications can be designed in a completely shared-nothing fashion—for instance, most consumer-facing Web sites that need the sort of scaling McIntosh envisions require access to a centralized user database (unless a single sign-on solution is used). But a surprising number of sites could be partitioned into sections with little shared data between the various site areas.
McIntosh's second observation is that a new class of ready-to-use infrastructures are becoming available that make horizontal scaling an economically feasible option. While amassing an array of lightly-used servers would have been considered a waste of resources just a few years ago, OS-level virtualization techniques turned such seeming waste into an economic advantage: instead of having to architect, configure, develop, and maintain a scalable software architecture, one can possibly build out, or use, a scalable (virtual) hardware architecture. The conceptual difference is merely that scaling is pushed into a lower infrastructure layer.
The example he mentions is Amazon's EC2 compute cloud:
The hardware architecture would be very similar and based on Amazon’s Elastic Cloud and S3 services. The idea being that the data would reside in S3 in a text format (we’ll use XML for sake of argument), with the actual site work running off of Elastic Cloud instances.
McIntosh's final observation is that although modern Web frameworks speed up development already, a new level of rapid development can possibly be reached by managing data in plain files, such as XML:
Rapid development. Isn’t that why we have Ruby on Rails, Grails and other frameworks like these? True, but how valuable are the domain driven OO frameworks that have dumb domain objects? I will concede that with some applications, working with the data is easier using objects that say a tabular model as the data is naturally hierarchal in nature. Then again, that is where the XML/JSON data models can fit. Also, frameworks like Ruby on Rails, Grails and ORMs like Hibernate, JPA, entity beans, etc. are most valuable when you need a full CRUD application. While both of these scenarios have CRUD operations, they aren’t data entry CRUD apps in the traditional sense.
McIntosh then outlines a system architecture that relies on possibly many server instances serving up and managing plain files, most likely in some structured format, such as XML or JSON. The system does not have a domain objects layer—instead, the controller layer presumably translates incoming requests to some file-related operation, such as reading a file or changing the contents of a file. And the outgoing operations are simply a matter of transforming file-bound data to a format needed by the presentation layer, such as XML, XHTML or, again, JSON.
McIntosh's suggestion could be quickly dismissed as simplistic, ignoring many decades of data management and application development wisdom, but for three reasons:
First, scalable data management has increasingly come to mean in-memory databases. Oracle's recent purchase of Tangosol, a leading distributed cache vendor, and earlier purchase of another in-memory database shop, TimesTen, are but two indications that in-memory data management is here to say. With falling RAM prices, it's possible to load several GB of data into main memory. Once that data is in memory, it is perhaps less important whether the data is accessed via a relational database layer or by application-level code that co-habits the same memory space. Tangosol's Coherence product, for instance, provides its own API for accessing cache-resident data. Other in-memory databases provide SQL or some other data-access API.
If an application can de-cluster its data across many server nodes, each node can load portions of the data into main memory and manipulate that data with application-specific code. To be sure, one of the key benefits of the relational model is exactly that it abstracts data storage and access away from application code. Yet, many applications, instead of providing direct database access, expose their data through an API, as in service-oriented architectures. Indeed, shared database access (when one database is shared by several applications) is increasingly the exception.
Another reason to entertain some of McIntosh's notions is that quick access to large amounts of data occurs through indexes—be those indexes managed by a relational database or indexes created ex-database, such as with Lucene. An application relying on, say, XML-based files for data storage could generate the exact indexes it needs in order to provide query execution capabilities over the files. And, in general, ex-database indexes have proven more scalable than database-specific indexes: Not only can such indexes be maintained in a distributed fashion, they can also be custom-tailored to the exact retrieval patterns of an application.
The final reason to ponder some of McIntosh's thoughts is that next to short access times, the most important requirement for a data-driven site is data availability. As more business-style applications migrate to the Web, the ability to keep data alive at all times is sure to become a central concern of enterprise application development. There is but one sure way to ensure high data availability, and that is via replication. But replicating data whose identity is tied in some way to a database management system, such as by database-specific IDs, for instance, makes replication harder. Many database products provide replication solutions, but none equal the scalability of simply copying files around a vast distributed filesystem, such as Amazon's S3. If data represented by such files have globally-unique identifiers, then, theoretically, any node could take over management of such files (keeping in mind some cardinal rules of replication, though).
I don't agree with many of McIntosh's ideas, but merely find them interesting, especially as we are confronting new challenges (e.g., the mandate to keep data alive at all times) and presented with new opportunities (Amazon's EC2, inexpensive RAM, in-memory databases). At some point, application architecture will have to change to take into account those new realities. I'm not sure McIntosh is right that file-based shared-nothing design is the path to the future, but real-world data management practices have greatly evolved from the days of the classic, centralized relational database and three-tier application design, and his ideas deserve some merit.
What do you think of McIntosh's notions of scaling a Web site without a database? If you don't agree with his ideas, then how would you scale an application to the extent the biggest consumer-facing sites require?
|Frank Sommers is a Senior Editor with Artima Developer. Prior to joining Artima, Frank wrote the Jiniology and Web services columns for JavaWorld. Frank also serves as chief editor of the Web zine ClusterComputing.org, the IEEE Technical Committee on Scalable Computing's newsletter. Prior to that, he edited the Newsletter of the IEEE Task Force on Cluster Computing. Frank is also founder and president of Autospaces, a company dedicated to bringing service-oriented computing to the automotive software market.
Prior to Autospaces, Frank was vice president of technology and chief software architect at a Los Angeles system integration firm. In that capacity, he designed and developed that company's two main products: A financial underwriting system, and an insurance claims management expert system. Before assuming that position, he was a research fellow at the Center for Multiethnic and Transnational Studies at the University of Southern California, where he participated in a geographic information systems (GIS) project mapping the ethnic populations of the world and the diverse demography of southern California. Frank's interests include parallel and distributed computing, data management, programming languages, cluster and grid computing, and the theoretic foundations of computation. He is a member of the ACM and IEEE, and the American Musicological Society.