Database Design: Do Not Fear NULLs

Last week, I found myself simultaneously confounded and fascinated by a conversation on the internal Microsoft SQL Server discussion list. If you know me at all, you know that I had to toss in my two cents, but I’m getting ahead of myself.

 

The thread started with a question from an application development consultant in the field: which is better to hold a blank value: NULL or an empty string? Before I had a chance to chime in, a couple of other folks responded. One wrote “avoid NULLs if you can”, and the other said, “IMHO, NULLs are way overused.”

 

Well, I had to take a contrarian view. NULLity is good. In fact, NULLity is way good. If one understands it and how it works in the database, I don’t think it can responsibly be “avoided,” and if one is following sound database design practices, I don’t think it can be “overused.”

 

Later conversation raised the legitimate point that there are databases out in the world that were designed by people who perhaps didn’t possess the requisite skills for such an undertaking. A lot of folks are dealing with poor designs that they’ve inherited and can’t change. In this sense, at least, I’m fortunate to work in an ivory tower: there’s nobody around me who’s designing databases who shouldn’t be running with that particular pair of scissors. To use another “sharp object” metaphor, NULLity is a tool, like a chainsaw is a tool. Use a chainsaw correctly, and you can heat your house for the winter. Use a chainsaw incorrectly, and you won’t be able to walk to the woods to cut down your trees.

 

So, in the interest of promoting the responsible use of chainsaws.. I mean, NULLs.. I humbly present a brief discourse on what NULLity means, how it works in the database, how to use it, and examples of its misuse. This post is a refactoring of what I wrote to the SQL Server discussion list on the topic.

 

NULL means “unknown”. When designing a table, one of the properties we set on each column is whether this is a legal response. If we MUST have an answer (for either business rule or data integrity reasons, for example), then we set the column to NOT NULL -- we’re saying “I don’t know” is not a workable response for this column.

 

When designing a table, I apply the following maxim when setting the NULLity characteristic on a column: if there’s a possibility that we’re not going to receive user input for a value that the user knows and we don’t, then we must allow NULLs (this falls under the general category of “respecting our ignorance”). It’s when NOT NULL is erroneously applied to columns that “workarounds” are applied to the data (“corruptions” might be a more accurate description), most likely to harmful effect.

 

Let’s run through a couple of examples.

 

Consider an nvarchar(50) column, CityOfBirth. The intent of the column is to hold the name of a person’s birth city (“Torrance”, “Toronto”, “Bradenton”, etc.). There are times, though, when we might not know this information. What do we put in the column then? My answer would be NULL, of course, because we don’t know. But let’s imagine for a minute that CityOfBirth has been defined as NOT NULL. We MUST supply an answer. This is where the blank string comes in, and those unfamiliar with the tenets of sound database design might think it’s no big deal.

 

Well, I think it’s a big deal.

 

The great thing about NULLs is, since they symbolize the absence of data, they fall out of JOINs -- if a value is unknown to us, then it can’t be matched to any other value, can it? If we select two random people and we don’t know where they were born, does that mean that they were born in the same place? The blank string in the database says “yes”; the NULL says “we don’t know because we don’t have data”.

 

So, a blank string is data -- singularly unenlightening data, yes, but still data nonetheless. More significantly, since blank strings are data just like “Torrance” and “Toronto” and “Bradenton” are data, they won’t fall out of JOINs. As far as the database is concerned, two blank strings match; two NULLs do not. Never mind “good” or “way good” -- that’s why NULLs are beautiful.

 

“But blank strings are still no big deal,” you say. Okay, let’s try another column, a datetime column DateOfBirth, defined as NOT NULL. If the customer doesn’t wish to share this information (my mother was well known for this), what’s a reasonable “unknown” value now? May 11, 1992? June 17, 1927? March 16, 1957? February 29, 2032? Whatever we put in this column, we’re corrupting our data just as we did with the blank string for CityOfBirth -- it’s maybe just a bit more obvious now. If we allow NULLs into the column, though, the whole issue goes away, because we have a reasonable representation for “we don’t know”.

 

NOT NULL columns and default values are great and useful things, as long as the default value is a reasonable member of the range of values for the column (a default value says, “if the user doesn’t nominate a value, we’re going to assume she wants the one we’re going to provide.”). There are places where this is very sound practice (GETUTCDATE() is a fine default value for a NOT NULL column called DateCreated, for example), and there are places where it’s not (“Torrance”, “Toronto”, “Bradenton”, a blank string, or anything else, for CityOfBirth; any date for DateOfBirth).

 

The only reason I can imagine for having a blank string as a default on a character column like CityOfBirth is that we’ve misapplied the NULLity property. If we’re in any way “hacking” or “working around” the NULLity property on a column in this fashion -- if the default value is not a reasonable member of the range of data for the column, and fairly likely to be the proper and correct entry -- chances are very good that NULLity is not set correctly.

 

For each column in each table we’re designing, there will be exactly one unique correct answer to the question, “should we allow NULLs in this column?” If we’re designing databases, one of our jobs is to find all of those answers. Check your requirements carefully, and remember my maxim: if there’s a possibility that we’re not going to receive user input for a value that the user knows and we don’t, then we must allow NULLs. When in doubt, use the Cliff Notes version: we’re here to store the user’s data, not make it up.

 

Got a different opinion? Please leave me a comment!

 

     -wp