Sponsored Link •
Relational databases offer a rich toolset to help ensure the integrity of data. Apart from normalized schema, however, few enterprise applications take advantage of those tools and, instead, tend to implement integrity checking inside application layers. If SQL database constraints are useful, why aren't more applications taking advantage of them?
An important trend in data management, especially on the Web, is to use relaxed SQL schemas—or to eschew the principles of relational data design altogether. A recent Artima discussion focused on the "NoSQL" movement, as it were—and the general conclusion seemed to be that relational constraints are very useful, indeed, and that developers should bypass them at their peril.
That's hardly a surprise, given that normalized relational data models came about as a way to help ensure data integrity. Users, in general, like to think of databases as authoritative sources of data: an underlying notion for popular data access methods, such as SQL queries, is that the return values represent a well-defined notion of "truth." Preserving that notion is much harder when relational constraints are relaxed, which puts most of the integrity maintenance burdens on application-level code.
While much of the discussion on NoSQL centers around database de-normalization—relaxing SQL normal forms that results in possible data redundancy—modern database management systems, as well as the SQL standard, offer many additional mechanisms to ensure data integrity. In perusing the latest edition of Joe Celko's SQL for Smarties—a highly recommended read—it became clear to me that I'm using but a small portion of SQL's power to ensure data integrity. For instance, I almost never take advantage of the SQL
check mechanism when declaring database tables, and I forgo other now standard SQL facilities, too, to, in effect, keep the database clean.
Although I believe in the value of pushing as much data management as possible into a database, I tend to implement a lot of data integrity-related code in the domain and controller layers. Of course, all that code needs to be tested, too. I recently surveyed a few dozen test suites I wrote using ScalaTest, encompassing several hundred test cases, and noticed that no less than 30% of our test cases had to do with enforcing data integrity constraints at the application layer. An example of that is making sure that API-level preconditions are enforced—checking for the proper handling of
null values, for instance. And if a complex object is passed as a parameter to a method, I tend to write tests to make sure the method can gracefully accommodate method parameter objects in unacceptable states.
We do all that, even though we also use Hibernate Validator in the entity layer, carefully annotating entity classes with required validation constraints. Finally, we also write a significant amount of integrity-checking code in the presentation layer—Flex, in our case—to ensure that users submit valid data to the server-side API. And that presentation-layer code has tests suites, too. In total, therefore, we encode data integrity the database, the entity layer, and in the presentation layer, with significant portions of our test cases devoted to covering integrity-checking code in the controller layer and in the presentation as well.
That's a lot of code that clearly violates the DRY principle. The closest to non-redundant integrity constraints I've seen was in ActiveRecord, Rails' O/R mapping and persistence layer. ActiveRecord annotates entity-layer classes with integrity constraints, and Rails' controller and presentation layers use that data to, for instance, display nice error messages to the user upon incorrect data entry. However, as with Hibernate Validator, ActiveRecord barely takes advantage of the underlying relational database's expressive power: Constraint-related SQL generated by both frameworks is limited mostly to foreign-key relations and non-null field declarations. The rest of the integrity checks are performed by a sort of validation layer (as part of the entity classes).
I think that in a more ideal situation the database should be the guardian of the data, including constraint enforcement. That can be achieved today by writing lots of hand-coded SQL, which may not be an effective solution because most SQL databases rely on SQL error codes to indicate constraint violations. For instance, setting a constraint so that the
salary field can't accept a value less than 5000 and greater than 1000000 will cause a database to return an error code when attempting insert a record with a
100—but that error code would then have to be translated into some usable value in other layers of the application. The effort involved in that translation pays few dividends, especially since it's not portable across relational database products.
Where do you think is the proper place to define data constraints? Do you agree that SQL should provide more standardized error reporting for integrity violations?
|Frank Sommers is a Senior Editor with Artima Developer. Prior to joining Artima, Frank wrote the Jiniology and Web services columns for JavaWorld. Frank also serves as chief editor of the Web zine ClusterComputing.org, the IEEE Technical Committee on Scalable Computing's newsletter. Prior to that, he edited the Newsletter of the IEEE Task Force on Cluster Computing. Frank is also founder and president of Autospaces, a company dedicated to bringing service-oriented computing to the automotive software market.
Prior to Autospaces, Frank was vice president of technology and chief software architect at a Los Angeles system integration firm. In that capacity, he designed and developed that company's two main products: A financial underwriting system, and an insurance claims management expert system. Before assuming that position, he was a research fellow at the Center for Multiethnic and Transnational Studies at the University of Southern California, where he participated in a geographic information systems (GIS) project mapping the ethnic populations of the world and the diverse demography of southern California. Frank's interests include parallel and distributed computing, data management, programming languages, cluster and grid computing, and the theoretic foundations of computation. He is a member of the ACM and IEEE, and the American Musicological Society.