Slight Ado About Nothing: More On NULLs

(updated 08 February 2006 to correct glaring technical inaccuracies; thanks to Adam Machanic)

Today saw a major step in the evolution of my humble blog: a question!, from Eric:

I have seen others use the NOT NULL DEFAULT "" and insist it is better that NULLS. One of the main reasons they give is that a NULL value cannot be indexed, and therefore slows down searches. Any thoughts on this?

I've been using SQL Server since v4.1, and I used Sybase for a couple of years before that back in the days when the code bases were the same. I've also used DB2, Informix, Oracle, and a host of single-user and long-dormant mini-computer databases. I've seen lots of people use the NOT NULL DEFAULT "" approach. I believe it has its origins in implementations of DBMS software where the concept of NULLity was ambiguously defined and/or poorly implemented (I recall one DBMS in the early 1990s where NULLs matched in equality tests! Ugh!). In such situations, database developers, always a resourceful lot, worked their way around the issue.

As a workaround to a sub-par implementation of NULLity, NOT NULL DEFAULT "" is tough to beat. I have worked in more than one implementation that was built that way. They all struck me as reliable, and majorly slow. Trading speed for reliability is something database developers do all the time. It's getting speed back without sacrificing reliability that's precious to a good developer.

In my experience, NULLity in SQL Server 2000 and SQL Server 2005 "works". That is, it has a clear design which is both reasonable and well implemented: NULL is the absence of information. There's no longer a limitation left for the workaround to work around. I would further argue that judiciously using NULLs will make your system perform better. Here's why.

There's a misnomer in your question: NULLs are indexed, but we need to account for them in our query syntax or else we'll miss those records. Records with NULLs fall out of JOINs, not out of indexes. We must write code that accounts for this possibility.

However, I find that trait enhances their "beauty" to which I referred in the original post. Indexes are primarily used to speed record selection or joins among tables; in those scenarios we're interested in data that's present rather than data that's absent. Will we ever want to find the column with the default value? Will we ever want to include them in a join to other tables, also with empty strings in their default values?

I don't think so, because I can't imagine a scenario where "" is a useful piece of information in a column. That's why you usually see a WHERE clause designed to exclude it. Is joining them and then discarding them more efficient than ignoring them in the first place? I don't see how..

My contention is that having NULLs fall out of JOINs is a good thing -- why clutter a return of data with pointers to a lack of data? As a practical matter, once a query returns more than a certain amount of the data in a table (I had reported this originally as 20% of the data in a table; in SQL Server 2005 the figure is much smaller), the optimizer will select a table scan and the whole issue of what's in the index will become moot. Not only that, but we don't have to worry about excluding the blank strings with a WHERE clause. Since they're NULLs now, SQL Server knows we don't know what the data is, so those records falls out of the join as we intended.

So, I don't think NULLs slow down searches. Using NULLity in such cases has the added bonus of reflecting reality in that we're not fabricating an arbitrary default to satisfy a NOT NULL clause when we don't know the data. Seems pretty sweet to me all the way around.

Thanks very much for your question, Eric. I'd be very interested in your thoughts..

-wp