Insufficient data from Andrew Fryer

The place where I page to when my brain is full up of stuff about the Microsoft platform

Think Data Science – Approach

I go to a lot of computer events and I feel for the organisers when 80 have registered and only about 30 turn up.  I am not sure why the no shows feel it’s OK not to spend a few seconds on Meetup or EventBrite to cancel, as a matter of common courtesy.  Anyway rant over, but it occurred to me that I could take a quote from the Mark Watney in the Martian and “Science the **** out of this”.  So could I build a model to predict the no shows?

Disclaimer: This post is one of a series about data science from someone who started out in BI and now wishes to get into that exciting field so if that’s you read on if not click away now

Firstly how does data science apply to this problem?  if we had access to some data about events where we have the attendance details including who registered but didn’t show up  then we could possibly make inference about what might affect attendance and use these characteristics to make predictions about drop out rates at future events. 

In the world of statistics and surveys we are using a sample (the historical data) to infer the behaviour of a wider population ( people who come to IT community events).  The important thing here is that I have good data and in this case good might mean that there aren’t invisible (confounding) variables affecting attendance e.g. road accidents and transport strikes might mean some people who wanted to come found it too hard to travel and went home.  Also some of my friends in the SQL community reckon that if the weather is good then drop outs will be higher so if I don’t take sunshine into account than that may also affect the accuracy of any model as well. However while modelling these is fine  we would have trouble predicting travel disruptions and predicting sunshine is hard as well.

Assuming we have a viable model it’s then up to us as humans on how we use it:

  • We could overbook the event by the amount of predicted dropouts.
  • While it would be hard to ban no shows from future events we could target potential no shows with e-mails to encourage them to come or at least let us know if they can’t make it.  We can then see if this works by only doing this e-mail targeting for some events.

So that’s the basic idea and to get started we need data but which data i.e. what factors might affect non attendance and is that data available?

Let’s look at the factors for the event itself:

  • Time of day – is it an evening or day event
  • Day of the week – At Microsoft we reckon that you our audience won’t be as likely to come to events on Mondays or Fridays and then there are weekend events.
  • Location? Is that a factor or are there factors about the venue such as parking and being near a train station?
  • What technology is being covered?
  • Does the headline speaker matter. For example at AzureCraft we had ScottGu over and certainly this leads to more registrations but was the dropout rate lower?
  • and as I said earlier was it sunny!

Then what about the delegate information? 

  • I don’t think someone’s name or address is important, however their proximity to the venue might be. So perhaps  we could get still get some location insights  but anonymise the address by working out the distance to the event or better yet the journey time to the event and plug that in. 
  • Whether what marketeers call the vertical they work in, matters e.g. finance, public sector( local central , NHS etc.) , retail and so on might be interesting to look at.
  • What about the role of the delegate e.g. developer, dba , infrastructure specialist, architect etc. 
  • Are they regular followers
  • Whether they attended or not

To put this last statement into data science speak this is supervised learning – the model being built is based on example data where the answer is in the data we are using which in this case the attendance of a registered delegate.  There are only two values in the answers “attended”  or “did not attend” so  this is known as a two class or binary classification problem.  There are several completely different techniques for modelling this and choosing the right one is partly about understanding what success is and partly by knowing certain characteristics about the data. 

Success in Machine learning might mean going for accuracy at the expense of time to compute or the other way around. some techniques/algorithms like neural networks take time to train, are resource intensive but are more generally more accurate. It’s also important to understand what data we are dealing with and there are three things to be aware of

Features & Labels

You may have heard the term feature engineering and that is the business of having the right attributes/ columns/ variables to use to predict an outcome.  For this example the features are the list above of things we know about the event and delegate. In supervised learning we then use these features to predict a label which  in our case the column/variable with whether the delegate turned up or not. 

Data Types

I have already mentioned that the label in this case is one of two categories  (attended or did not attend) and so the label holds categorical data.  There are other features described above that are also categorical like the role the delegate has, the location of the event and anything  else that is a  string.  However numbers can also be categorical – the staff ID at Microsoft, my mobile phone number as these are just lookup codes to identify me or some object – In other words we can’t apply maths to them e.g. average min max etc.   On the other hand  quantitative data are numbers where we can do this like my age, the journey time for a delegate to get to an event.  Finally there is also ordinal data like education level where BSc, MSc and PhD are levels where one is more advanced than the others.  Sometimes we can make quantitative data ordinal by creating buckets and putting values in them like age ranges you often see in surveys (under 18, 18-25, 25-40 40+ etc.)

Data Understanding & Visualisation

Then we need to understand how  the data is distributed compared to the thing I am looking to predict and across the other features.  Data Scientists often want to plot data to do this like this box plot showing how inferences can be made about the speed of light form the famous Michelson Morley experiment..

Here the box shows the middle 2 quartiles of data split by the median the “T” and upside down T shows the minimum and maximum and the dots in column 1 & 3 show excluded outliers from this analysis 

Plots like this and building up layers of charts are one of the great strengths of R or Pandas in Python as well as the rich ecosystem of statistical algorithms they have.  They inform the feature selection and engineering as well as evaluation of how well the model is performing. 

Data Wrangling

The whole process of understanding the data and feature engineering is known as data wrangling which  If like me you are from a BI background you may think that Excel and technologies like Integration Services in SQL Server are the tools for this but data wrangling goes beyond this as it’s as much about understanding the data as transforming it. So I have seen people using Azure Machine Learning just as a data wrangling tool as it puts a public web front end at scale over their Python & R scripts and to be able to share their notebooks containing their work.  My point is that not only do need to manipulate the data we need to use statistical techniques to analyse it like from simple plotting of logs to look for outliers to the common techniques like Chi Squared, Pearson, Spearman, Kendall Tau etc. depending on the problem space.

Hopefully that’s given you some ideas on how all this works in a simple scenario.  However that simple scenario with different features could be used to look at patients not turning up for appointments with their GP, truancy in children and that holy grail of marketeers customer churn. If you want to know more then there is a degree course for this and currently my favourite part of that is module 4 on statistics from Columbia University. They are free to evaluate and $49 per course if you need a certificate.