SQL Server for Data Mining

SQL Server 2005/2008 has a number of data mining algorithms built in to discover hidden relationships in your data or to use a set of attributes to predict the value of other attributes:

  • Microsoft Association Algorithm. Used to work out which items are most likely to appear together in a transaction, so this is the classic basket analysis you might see on Amazon (customers who bought this all bought..).
  • Microsoft Clustering Algorithm.  Groups similar data items together. For example in banking there are lemons (poor customers) and cherries (good ones) and this algorithm will do that segmentation.
  • Microsoft Decision Trees Algorithm. used to predict the value of a discrete (as opposed to a continuous) item e.g. whether a customer will buy a new car might or not.
  • Microsoft Naive Bayes Algorithm Also predicts the value of an item but this works for continuous values such as the probability of a customer making a purchase. The naive indicates that there is no interdependency between the attributes used to predict he outcome e.g. gender is not related to age.
  • Microsoft Neural Network Algorithm (SSAS) . This is supposed to mimic the way the brain works and is the scariest algorithm because there is no way to see how the answer was arrived at, and is used complex situations such as stock market prediction.
  • Microsoft Sequence Clustering Algorithm This finds similar instance of events where they have similar sequences of steps. A good example is the analysis of navigation through a web site (click-stream analysis)
  • Microsoft Time Series Algorithm. Used to predict the future e.g. future sales, profit etc.
  • Microsoft Linear Regression Algorithm Plots an ideal line to best represent a series of points to establish the relationship between two continuous variables e.g. age and salary
  • Microsoft Logistic Regression Algorithm Works in a similar fashion to linear regression but uses a curved line where the predicted state is one of a discrete set of possibilities.

There is a great section on MSDN on when to use which one here.

One of the best ways to get your head around this is to come to an event like this joint Microsoft / Hitachi TechNet session by Rafal Lukawiecki of Project Botichelli.

 

Technorati Tags: Data Mining,SQL Server