Full text search has been around in SQL Server since the 2000 edition and has gradually evolved in each subsequent release and it is included in SQL Server express so you don’t even have to buy it! In CTP6 (February) the latest revision appeared, The basic difference is that there is no longer a separate service and all of the indexes etc. are included in the database. This has two important advantages:
- Queries that combine a full text clause with a normal SQL where clause will be much quicker as the optimizer can work out a unified plan for the whole statement.
- Backups will include the all of the full text stuff which gets rid of integrity issues ands makes recovery faster.
Indexing a big lump of unicode text in a column on a table requires little effort, but the real power of this feature is its ability to index various types of document stored in the database in varbinary(max) columns. For this to work the engine has to understand the type of document that’s in there and to match this to a set of filters. There are in built filters for office documents PDF’s etc. However When SQL Server 2005 came out it came with a base set of these filters for the products around at the time such as Office 2003. Since then Office 2007 has shipped and we have the new file types such as docx for word and xlsx for excel etc. The good news is that you can add to the set of filters and the new office 2007 filters are here. The not so good news is that if you’re working on SQL Server 2008 you’ll need to do the same thing.
How does all this work? Let’s have a look at the Adventure Works Production.document table as installed (note I am using SQL Server 2008):
In this table the document itself is in the [document] column and its type is in the aptly named [FileExtension] column. Because the table has not been setup for full text searching all of the options for it are greyed out in the table designer. To fix this right click on the table and start the wizard:
and select design full text index. The wizard will take you though what you need to do. For this demo I:
- Set the primary key to the primary key,
- Selected only the [Document] column to be indexed , set the language to English, and set the type column to point to the [FileExtension column].
- Didn’t use a stop word list. A stop word list is a list of words that are so common that they aren’t worth indexing as the resultant index will be huge. Examples of stop words in English are ‘it’, ‘and’, ‘the’ and especially ‘so’ if you work at Microsoft!
- Created a new full text catalog ‘demo’.
- Set the change tracking to automatic i.e. any change to the [document] will automatically update the full text index. which maybe something you don’t want to do in production as this can slow down the server for large indexes.
So now the index is ready to use and in my next post I’ll show you how to use it. In the meantime there are lots of resources on this including: