Microsoft SQL Server 2005 fuzzy logic in matching and grouping - how does it work in SQL Server Integration Services?

Following on from a presentation 'First Look at SQL Server Integeration Services' to some MVPs here at TVP Reading, I am responding to the a question I was asked, 'how does SQL Server's fuzzy logic work'.  I was meant to explain how the confidence and similarity values are calculated - I had no idea, so here's the results of some research I've done.

Thanks to Donald Farmer I can point you to:
Technical paper detailing the matching technology used by Fuzzy Lookup
Robust and Efficient Fuzzy Match for Online Data Cleaning. ACM SIGMOD 2003, San Diego. Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani.

ftp://ftp.research.microsoft.com/users/DataCleaning/sig03_FM.pdf

 

Thanks to Grant Dickinson, he adds:
I believe this is the algorithm used:

https://www.vldb.org/conf/2002/S17P01.pdf

 

Here's some info on the researcher:

https://research.microsoft.com/%7Esurajitc/

 

A lot of this looks incomprehensible, so to those without a degree in Maths I’ve also included a piece from Books On Line

 

Controlling Fuzzy Matching Behavior

The transformation provides a default set of delimiters used to tokenize the data, but you can add token delimiters to suit the needs of your data. The Delimiters property contains the default delimiters. Tokenization is important because it defines the units within the data that are compared against each other.

 

The Fuzzy Lookup transformation includes a property that specifies the maximum number of matches that the transformation can return. The transformation returns zero or more matches up to the number of matches specified. Specifying a maximum number of matches does not guarantee that the transformation returns the maximum number of matches; it only guarantees that the transformation returns at most that number of matches. If you set the maximum number of matches to return to a value greater than 1, the output of the transformation may include more than one row per lookup and the some of the rows may be duplicates.

 

Each match includes a similarity score and a confidence score. The similarity score is a mathematical measure of the difference between the input record and the record that the reference table returns. The confidence score is a measure of how likely it is that a given value is the best match among the matches found in the reference table. The confidence score assigned to a record depends on the other matching records returned. For example, matching St. and Saint returns a low similarity score regardless of other matches. If Saint is the only match returned, the confidence score is high. If both Saint and St. appear in the reference table, the confidence in St. is high and the confidence in Saint is low. However, high similarity may not mean high confidence. For example, if you are looking up the value Chapter 4, the returned results Chapter 1, Chapter 2, and Chapter 3 have a high similarity score but a low confidence score because it is unclear which of the results is the best match.

 

The similarity score is represented by a decimal value between 0 and 1, where a similarity score of 1 means an exact match between the value in the input and reference table columns. The confidence score, also a decimal value between 0 and 1, indicates the confidence in the match.. If no suitable match is found, similarity and confidence scores of 0 are assigned to the row and the output columns copied from the reference table will contain null values.

 

In some cases, Fuzzy Lookup may not locate appropriate matches in the reference table. This can occur if the input value that is used in a lookup is a single, short word. For example, the input value "helo" is not matched with the value "hello" in a reference table when no other tokens are present in that column or any other column in the row.

 

The transformation output columns contain the input columns that are marked as pass-through columns, the selected columns in the lookup table, and the following additional columns:

 

_Similarity, a column that describes the similarity between values in the input and reference columns.

 

_Confidence, a column that describes the quality of the match.

 

The transformation uses the connection to the SQL Server 2005 database to create the temporary tables that the fuzzy matching algorithm uses.