Insufficient data from Andrew Fryer

The place where I page to when my brain is full up of stuff about the Microsoft platform

So you want to be a Data Scientist

In the dark days of the last millennium data scientists were serious statisticians using exotic hardware, expensive software and were largely isolated from the rest of the organisations they worked for.  Today the cloud and open source languages like R & Python have made this technology available to anyone curious enough to be interested in it. 

So what do you need to get started? – My Top Six would be:

Curiosity

Are you the sort of person that is always challenging why things are the way they are in your organisation instead of taking matters on trust?  Are you keen on experimenting with new technologies? Are you interested in the art of the possible? If the answer is yes then data science might be for you if you are also..

Evidence Based

I got a great e-mail after a presentation I gave from a lady called Charisma from Ghana – “In god we trust everyone else bring data”. Data scientists need to be scientists by using their creativity and curiosity to come up with hypotheses and then prove them.

Statistics.

Yes you do need to understand the basics of statistics, but with the amazing tooling out there and the great online resources for learning like EdX and Coursera you can acquire the principles and then quickly apply these to your problem domain.

Data Skills

Hopefully you are reading this because you are some sort of data professional e.g. from a BI or data background.  You know about joins, you get data quality and worry about missing data, cartesian products, and how annoying working with dates can be.

Business knowledge

.  Hopefully you understanding the domain you are working in and what you do is closely aligned to that.  This is your sanity check that means that when you find a cause an effect from a piece of analysis it does actually make sense and is not merely a couple of random statistics that have the same curve e.g. Per capita cheese consumption is not really related to deaths by getting entangled in the bed sheets as per Spurious Correlations by Tyler Vigen:

image

Ethics

Just because we can doesn’t mean we should.  In the work I do I am trying to make a difference so my data science dojo is a hack with a charity trying to improve a given situation be that environmental issues, saving lives or at least improving the quality of it.  I also mention this because computer don’t have ethics per se any more than a new born baby does they are imbued with behaviour patterns by other humans.  My litmus test is would my reputation improve  if my use of analytics and big data was made public. 

Is data science is for you?

I mention all this because demand for data scientists is outstripping supply and while many people like my are working with academia to drive and promote the next generation of data scientists we have a huge hole today.  Now I am not going to suggest you can just mug up on some of this stuff and put data scientist on your CV. Rather I am suggesting their are places in the data science world for people like me who have come from a data/BI background as many of our skills are transferrable.  For example our knowledge of the business and data skills are still very valuable.

However we are likely to be light on statistics and even if we studied it to some extent we have forgotten a lot of it and what matter now is applying those statistical theories to our data.  For example which algorithm should I use to select which attributes of a patient and their doctors appointment are most likely to influence their attendance at a doctors appointment?  In Azure Machine Learning (MAML) if you use the filter based feature selection module to do this you are presented with seven options such as Chi Squared, Spearman, Pearson, Fisher, Kendall but which to use?  The answer is a post in itself and the good thing about data science is that there lots of resources on line.  The bad thing about the online content is that while much of it is written in English it is English-Stats not English-GB so can be very hard to decode as so much prior knowledge of stats is needed.  On thing I use a lot is  this article on which algorithm to use for what in MAML and even if you are in R or Python or some other technology this is still useful.

So a question – Do you want to get into data science?

To test this hypothesis for yourself – I would encourage you to look at the Microsoft data science degree course on EdX.  This can be done for free to see if you like this stuff, but if you like what you are seeing then you can actually get a degree.  To do that you have to pay for a verified certificate for each of the ten modules you must do and there are options at various stages..

image

The last module is an actual project where you apply what you have learnt, and on a side note my good friend Amy Nicholson (@amykatenicho) has just made the module on Developing Intelligent Applications so watch out for tough questions on this!

So let’s do some science, see if the course helps and feedback any comments you have as you learn.

Note: none of these degree course is going to get deep on the stats I have ben talking about but there are courses out there for that too.  My own journey has been greatly helped by Prof Andy Field and his book on Learning statistics using R.  He’s also all over YouTube..

image

You have been warned!