Chapter 4: Database as a Fortress

Back to Table of Contents.

Database as a Fortress

If you ask a bunch of CEOs what their company’s greatest asset is, most will tell you it’s their people. That certainly sounds nice. It may even be true for companies that don’t deal with data. However, if a CEO who works for a company based around a website says people are her greatest asset, she is definitely lying. Great people got the company where it is today, but now that it’s a success, the most important asset is the data that has accumulated. If Edwards, your super-star coder, gets hit by a bus, it will take you six to eight weeks to train up Henderson or Stevens or Erikson to replace him. However, if you have a data meltdown—one that creeps in slowly, undetected at first, until all your precious data is turned to garbage—be prepared to start over from scratch. Not even your backups can help, because they’re all corrupt, too.

The most important asset of a web-based company is its data. The most obvious type of data to protect is operational data. If you sell goods online, your site is useless if the product descriptions don’t match the products. If you run a social networking site, who will come back if the network links get lost, crossed, or lead to user pages that no longer exist? What good is an online personal information management tool if your to-do list items disappear before you get to check them off yourself?

Historical data corruption is another common and insidious problem. Imagine if you could no longer report on how many units of a particular widget you sold month over month last year simply because you no longer sell that widget today. Was the item’s database record deleted when the item was taken off the shelf, and now the historical data referencing it points to an empty record? Or what if data you think is important actually isn’t? In your hosted blogs site, are you reporting statistics of total comments added site-wide, but half of the comments are for entries that have long since been deleted by the author? Operational data changes with the times, but historical data that references yesterday’s operational data needs to be accessible and accurate today and tomorrow, too.

Most web framework books teach you how to add data to your database, but they don’t teach you how to protect it. This book picks up where those books left off. The remainder of this chapter is intended to help you frame the way you think about databases. Databases are a major part of your entire architecture, not just a place to store application data. The next four chapters show you how to design a solid data model incrementally, and how to integrate it tightly with Rails.

Your Database Is an Application, Too

We tend to think of a web framework as the solution to all problems. Rails especially tends to abstract other pieces of web architecture away so that Rails itself seems to be the only piece of the puzzle. This is especially true of how Rails abstracts away the database. If it supports Rapid Application Development (RAD), and it has AJAX, then you’re all set, right? Your site looks really spiffy, and it only took ten 10 minutes to write. Along the way, you need to store some data, so of course you’ll stick it in a database. Add some tables, wave the RAD wand, and off you go. Don’t forget to add some unit tests (for the cases you thought of), and everything is hunky dory. Life couldn’t be better than with your shiny new web framework. Rails now ships with SQLite as the default database, so that you barely have to think about setting up a database at all. Next, the task of writing DDL has been buried behind migrations. DML, the bread and butter of SQL queries – are abstracted away behind ActiveRecord. Finally, the task of maintaining data integrity is left to ActiveRecord validations.

The problem with abstracting to this degree requires that you to make a few assumptions that are unlikely to be true.

“One Framework to Rule Them All”

There are many frameworks out there besides Rails. There’s PHP/Cake, Drupal, Django, Struts, Perl/Mason, etc. The list does go on and on. If you’re lucky, you’re rewriting your legacy PHP or Java application in Rails right now. If so, one problem you now face while you’re busy implementing the latest JavaScript interface magic is remembering all of those special cases and boundary conditions that led to bugs in your legacy PHP system. It took the previous engineers years to stamp out each pesky software bug, and you have to replicate all of this intricate logic again while also rewriting the interface from scratch so that the new site is 10 times snappier than the old one. Maybe you are painstakingly meticulous and everything turns out all right. But what happens in the next iteration when you switch to the yet newer, more whiz-bang framework? Hopefully your next framework is the next version of Rails, but you get the idea. Software is constantly in flux, but the data you collect over the years is not. Wouldn’t it be nice if you could ensure the integrity of your data without concern for the current software stack sitting on top?

“No One in My Company Will Ever Write a Bug”

The plain and simple truth is that software has bugs. Your application code will change much more frequently than your database schema. When you add new columns to a database table, it’s very easy to forget to add all the appropriate ActiveRecord validations. It’s also easy to comment well-intentioned validations out, but then forget to uncomment them. Finally, there are lots of scenarios for which no ActiveRecord validation exists in the first place—referential integrity constraints being the prime example—so relying solely on ActiveRecord validations to maintain your data’s integrity is simply a recipe for disaster. On the other hand, built-in mechanisms of an RDBMS can make protecting your data easy and worry-free. Accept that your application will have bugs, and leave it up to the data layer to be the final gatekeeper of what is allowed to enter the database.

“This Rails Application Is the Only Application on Earth”

The next assumption is that the application you are writing is the only application that will ever access the data you are storing. Forget about wholesale framework switches here. As your application grows, you will add myriad scripts that run scheduled maintenance tasks to clean up or to summarize data. You will write quick-and-dirty tools that live outside of your website’s main code base. You will even (probably more frequently than you expect) access the database directly through a database client and manipulate your data with raw SQL queries.

In all of these scenarios, you are likely to be bypassing your ActiveRecord validations. Therefore, it’s necessary to rethink the main function of these validations. Since the scope of the validations is only the application in which they reside, they cannot possibly be relied upon to protect your data from other rogue programs, or even from a well-intentioned developer sitting in front of a SQL prompt. The validations do help generate an interface that gives the user useful feedback before rejecting bad input. And that’s the key: validations do not safeguard data. They can be bypassed, turned off, or easily deleted. Only at the data layer itself can this be accomplished.

Sit Atop the Shoulders of Giants

The field of database technology is large, mature, and there is ongoing academic research on storing, searching, and making sense of data stored in a database. The commercial Oracle database was first introduced in 1979. The first version of PostgreSQL, the best choice at the time of this writing for an open source RDBMS, appeared in 1989. An unparalleled amount of research and development has gone into these products, and best-of-breed choices in this area have not changed every few years they way they have for web scripting languages and frameworks.

If you treat the database simply as a place to dump your application’s data for later retrieval, you are shortchanging yourself and your application. When used correctly, not only will your database safeguard your data from the effects of errant code, but it will also afford you aggregation, computation, and retrieval speed that you could never hope to reproduce with even the cleanest or most elegant application code.

It behooves a web application developer to learn not only the ins and outs of the web framework, but also of the RDBMS atop which that framework sits. Remember that your database, which contains your company’s most precious asset—its data—is very likely to outlive the application you write on top of it.

Choosing the Right RDBMS

All examples in this book assume the use of PostgreSQL. For those using Oracle or another database that adheres closely to the SQL standard, the concepts are identical, although some of the syntax may vary slightly. Many features of the SQL standard are not implemented in MySQL, so unfortunately a number of the advanced topics are not possible to implement using MySQL as of version 5.0. For this reason, although MySQL is undeniably popular, it’s not recommended for a serious website.

Why not MySQL? The Rails core team uses ySQL, and it is undeniably more popular within the Rails community than PostgreSQL. This brings up two important questions. First, if MySQL isn’t as good as PostgreSQL, why is it so popular? And if PostgreSQL isn’t as popular as MySQL, why is it used for the examples in this book?

MySQL gained popularity for two important reasons. First, although it is open source and free, a company called MySQL AB got behind it to offer support to enterprise customers. When the shift from closed to open-source software began, having this type of insurance was key to adoption of open-source products. For whatever reason, in the early days, PostgreSQL did not have the same level of corporate support offerings as did MySQL – although today there is plenty of support from a variety of vendors.

The second reason for MySQL’s ascension is that it always had much simpler point-and-click installers on Windows, whereas PostgreSQL remained, for a long time, the domain of UNIX and Linux users. In the PostgreSQL community this made sense, because databases are hosted on *NIX servers – why would you need point-and-click Windows installers? It should be a lesson, then, that any barriers you erect to users using your software will be to your detriment. Of course, many developers and decision makers worked on Windows, and their inability to easily give PostgreSQL a test run often pushed them to MySQL.

So MySQL won the popularity contest – why don’t we accept that in this book? It’s for the same reason that *NIX is the de facto choice for servers, even as Windows dominates the desktop market. PostgreSQL is simply better at doing the job of being an RDBMS. In addition to implementing much more of the SQL standard, and more faithfully, PostgreSQL also has a much better query planner than MySQL. In addition, MySQL has a variety of strange vestiges from its early days, such as the number value zero being treated as equal to the absence of a value, NULL. MySQL is also case insensitive by default. Idiosyncrasies like this seem small at the outset, but often come back later to haunt you.

Working with PostgreSQL instills a feeling of safety that MySQL does not. If you haven’t made the switch yet, it’s worth trying PostgreSQL out as you read this book.

A Note on Migrations

One of the strengths of the ActiveRecord Object Relation Mapping (ORM) library is that it provides an abstraction layer between the application developer and the database for the Data Manipulation Language (DML) components of SQL. With some exceptions, this abstraction provides a convenient interface to most inserts, updates, and deletes that your application will need to perform, and you won’t have to worry about syntax peculiarities specific to a particular database product. In many situations, the abstraction is fully adequate, and for the exceptional cases, you can always execute arbitrary SQL to get the job done.

For anyone who has written a database-backed website without an ORM, it’s probably not the absence of SQL, which is a supremely straightforward language, that makes ActiveRecord worthwhile. The real benefit of ActiveRecord is the automatic unmarshalling of results from SQL queries into Ruby objects, an otherwise tedious, manual task that can be painstaking and error-prone. The following code to load a user record, which you would never write in a Ruby on Rails application, is an approximation of what life is like without an ORM:

db_result = ActiveRecord::Base.select_one("
  select first_name, last_name, birthdate, favorite_food
    from users
   where id = #{id}
")
user = User.new
user.id = id
user.first_name = db_result[:first_name]
user.last_name = db_result[:last_name]
user.birthdate = db_result[:birthdate]
user.favorite_food = db_result[:favorite_food]
With ActiveRecord, the same is accomplished with a single statement:
user = User.find(id) 

Abstracting DML provides a fantastic reduction in the amount of rote code that must be written (and rote code that seldom needs special cases, at that). However, one less successful consequence of the desire to fully abstract the database layer has been an attempt to abstract the Data Definition Language (DDL).

DDL statements are those that define tables and sequences, create indexes on tables, and define stored procedures. While DML statements occur throughout an application with every insert, update, or delete statement, DDL statements generally do not appear within applications at all. DDL statements, because they define the structure of your data layer, only get executed once, usually when your application isn’t even running.

Therefore, the attempt to abstract DDL through migrations has not been the boon that abstracting DML has been. The first reason is the lack of benefit gained from using migrations rather than writing straight DDL. Whereas abstracting DML provides a huge savings in the amount of repetitive code that must be written to create objects, as shown above, migrations don’t provide any such benefit.

The second reason to be wary of migrations is that as of this writing, they support only a small subset of the DDL language. Just as with DML, you can always mix in some custom DDL with your Ruby migration code, but in this case, the consequence is that you’ll be writing much more code altogether, and the result will be much less succinct than if you had written it all with DDL statements.

Because migrations have had a number of shortcomings, they’re under continuous change. Rails 2.1 solves the problem caused by multiple developers trying to write migrations at once; pre 2.1, migrations were named in a way that required lots of developer communication. While it’s likely that migrations will continue to change, it’s not guaranteed they will ever be a good replacement for DDL, which, as it happens, was designed specifically the purpose it servers, data definition.

Therefore, in this book, we forget about migrations and built and manipulate our schemas using plain old SQL DDL statements. Every developer should understand SQL DDL, and if your organization uses migrations, it shouldn’t be hard to learn the state-of-the-art in migrations, and implement your DDL that way. The important thing is understanding what is supposed to be happening beneath the scenes, so you can still write DDL when you need to. Many operations that are easy with DDL are still – and some always will be – impossible with migrations. If you rely on migrations and skip the DDL that migrations don’t support, you’re shortchanging yourself.

Dispelling Myths

There is a camp of web developers that will tell you the topics in the following three chapters, all of which deal with different types of referential integrity, are unnecessary overhead. Those in this camp will tell you that application level checks are sufficient to protect your data, and that database-level constraints are sure to make your application slow. It turns out that whenever I meet someone from this camp, it just so happens that they are die-hard MySQL users. Not surprising, because until recently, MySQL did not support referential integrity. MySQL, before version 5.0, was not a Relational Database Management System, but rather simply a Database Management System. What these people are actually telling you is that RDBMSs are no better than DBMSs, and when the idea is framed that way, it becomes clear that this camp simply doesn’t fully understand why relational databases exist in the first place.

When MySQL came on the scene, it had wild success because it was well-packaged and easy to install not only on Linux, but on Windows, too. MySQL and PHP swept the Internet community because they were so easy to set up, and you could be up and running and writing web pages in just a few hours. As developers got used to their tools, they got used to the deficiencies in those tools, too. So the lacking of a feature became a feature in and of itself. I have actually heard very highly paid consultants say, “MySQL doesn’t support referential integrity because you don’t need it!”

Since enterprise-level applications do need referential integrity, as well as many other features available in other more mature RDBMSs (e.g., views, transactions, triggers, isolation, etc.), MySQL has added support for it in version 5.0. Therefore, the argument that you don’t need it because MySQL doesn’t have it no longer holds water. MySQL now has it because you need it.

Another myth worth dispelling is that referential integrity is just training wheels that you should take off when your application is in production. This type of thinking could not be more backwards. Referential integrity constraints certainly do help you find your application’s bugs, but it’s foolhardy to think you’ll find them all before you decide it’s time to throw users at the system. This warning holds doubly true for when you start releasing updates once your application is already live. Testing every conceivable use-case, including full regression testing for each release, is next to impossible. On the other hand, your users—including Internet bots, both neutral and malicious—will make your application run the gauntlet. They will find use cases you never imagined. Production is not the place where you want to find referential integrity bugs. But worse, production is also not the place where you want referential integrity bugs to go unnoticed. Your database constraints and referential integrity checks are the last line of defense protecting your data before things go awry. If you want to keep your data intact, you want your constraints to be as complete as they can be.

Operations and Reporting

When you launch your website, it will be the primary (probably the only) consumer of your database. All of the queries your database handles will be related to making your app go. If everything goes well, you’ll have lots of users and you’ll start collecting and generating lots of data about those users and your website’s operation in general.

Around this time, the business development team will start asking you questions. How many new users join your site each day? Each hour? Is there a geographic distribution to your user base? What features are heavily used and which go unused? How many repeat visitors did you have last week, and what was the revenue result of the costly marketing campaign?

The natural thing to do to answer these questions is to start building reports that answer them. You add a report that breaks down new visitors and repeat visitors by day and hour. You add a report that shows access by state—perhaps plotting hits on a U.S. and world map. You add a report that shows revenue events as they relate to different traffic sources—external links versus unreferred traffic versus links from your email campaign. You add more and more reports almost as quickly as your business users can request them. Your business users are delighted. They check the reports frequently to increase their pleasure at how well the website you wrote is functioning. That is, until suddenly, one day, performance plummets. All of the metrics in your reports take a nosedive. The business users, flustered, take to looking at the reports you’ve generated all day, hoping to make sense of what went wrong.

What went wrong is that your reports are killing your site. As your site’s popularity and success increases, so does the amount of data your reports need to process. What seemed like a reasonably fast query—maybe 10 seconds to give user statistics for the last month by hour—now takes 30 seconds or maybe even a couple of minutes. And since your company became so successful, you hired more people who are looking at those reports. And since each report now takes minutes to generate, your business users fire off a bunch of simultaneous reports and then go get a cup of coffee. All of this has the effect of bogging down your site, and locking out the very users you to whom are trying to serve web pages to.

I call this the Heisenberg Uncertainty Principal of Website Reporting. If you try to report out of the same database in which you are collecting your data, the simple act of loading the reports creates abnormal load on your database. That in turn makes your web pages slow, which causes your users to leave your site frustrated, which causes you to load more reports to figure out what’s going on, which frustrates yet more users, and so on. Whenever you look, you impact the system in a negative way.

The solution is simply to not run reporting queries on your production database. But that is easier said than done.

A common shortcut many people take around building a data warehouse is to create a slave copy of their database, and run heavy reporting queries there, out of the path of users. This is not a good idea. To understand why, it’s important to be familiar with the difference between Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP).

OLTP comprises the set of queries that store, update, and retrieve data. Examples would be creating blog entries, then displaying them to website visitors. OLTP queries are engineered to execute quickly, as there is generally a user waiting for the result of the query. Speed is realized through a highly normalized schema. Each row contains a small amount of data and contains pointers to related data in other tables. Virtually all websites are OLTP systems.

OLAP queries are geared toward garnering business intelligence out of large quantities of data. OLAP queries process millions upon millions of records generated by individual OLTP queries. A sample OLAP query might be one that answers the question, “How many customers who bought a sale item also bought a nonsale item, broken down by store location and week?” In addition to ad-hoc queries such as this one, nightly or quarterly generated reports are OLAP queries, too, and therefore real-time results are generally not a requirement of an OLAP system.

Because OLTP and OLAP queries are so different, it’s not surprising that a database design that is well-suited for OLTP may not be well-suited for OLAP, and vice versa. In fact, as the amount of data in a highly normalized system increases, coupled with increasing complexity of reporting queries, it’s often the case that reporting queries start to take seemingly infinite time, or infinite memory, or both.

In OLAP, the goal is not to have quick inserts, updates, and deletes, but rather to filter, group, and aggregate huge amounts of data based on certain criteria. For this task, highly normalized schemas result in lots of costly joins on massive amounts of data. Denormalizing the schema to avoid most, or even all, of the joins can make OLAP queries complete in a reasonable amount of time.

So a highly normalized database is good for normal site operations, but a denormalized database is good for reporting. Can these two be reconciled? Unfortunately, no, they should not be reconciled within the same database. Denormalizing data in an operational database can quickly lead to bugs (so-called insert, update, and delete anomalies). And staying normalized causes reporting queries to be unreasonably slow, and downright dangerous if they are executed in the same database as are your OLTP operational queries.

The proper place to run reporting queries is in a data warehouse. A data warehouse is, in rough terms, a place where all of your historical data resides, and in a format that is optimized for complex reporting queries. OLAP systems rely on highly denormalized data, usually in a star or snowflake normalization pattern, which increases the speed of processing huge amounts of data by eliminating joins on many tables. Here, inconsistencies are not a concern, because your star schema data is generated from your DKNF data, which you bend over backwards to keep accurate.

Getting data out of your production database and into a data warehouse is not an easy task, though, and it’s hard to convince anyone that you need to spend loads of time building a data warehouse before you have any meaningful data to report on. However, as soon as your website appears to be doing reasonably well with users, it’s time to invest some resources in building a data warehouse. And do it before you build lots of one-off reports that will surely cause the Heisenberg Uncertainty Principle of Website Reporting to set in.

Therefore, while chaining a slave off your database to run reporting queries (an extremely common practice in the MySQL world) seems like low hanging fruit, it’s really not the fruit you want. It’s really kind of like rotten fruit. The fresh, delicious fruit you want for reporting is a data warehouse.

Unfortunately, building a data warehouse is beyond the scope of this book. There are many books on the topic, though. A good introduction is The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling by Ralph Kimball (Wiley 2002). but a topic that does fall squarely in the scope of this book is ensuring that your website scales for users. And that is the topic of the next several chapters on schema design.

Chapter 3 : Organizing with Modules
Chapter 5 : Building a Solid Data Model