Many of us have flown and many of us have have been held up by a late flight which is annoying for us, expensive for the airline, and a potential safety issue for air traffic controllers juggling to get aircraft landing slots.
So what causes flights to be late? To answer that we might take a look at stats on flights that are late compared to ones that arrive on time. As IT guys it might be our place to answer that rather we would gather the data for a business analyst to explore what’s going on using tools like analytic cubes and reports. So having had coffee with the business analysts she reckons that we need historical flight data and if possible historical weather data. The American Federal Aviation Authority (FAA) do make the necessary historical flight delay data available for free, however historical weather data is not available for free although the FAA do have a free service to get the weather data for any airport worldwide available for free (more on this an old post of mine). So what we can do is to join the weather data twice into the flight data based on the weather at the originating airport and the weather at destination airport for the time the flight departed and arrived. My business analyst can now play away to his hearts content and produce thing like this (here I have used the new Power BI)..
From here we can see which airlines have the most delays, where delays build up (Chicago is the worst possibly because it’s the biggest and at near capacity), and what time of day do most delays occur (18:00). If I select Chicago I can see the impact on the other factors like this..
And on a separate page there some weather related analysis..
But these don’t really help at all as there’s no obvious correlation and our poor analyst could spend hours creating reports and cutting the data to try and work out why flights are late. The problem is that a tool like Power BI isn’t really designed for this sort of problem where there potentially lots of factors affecting an outcome. Even if Power BI could do this, what we really want to do is to predict when flights are late before they have landed.
A data scientist would probably laugh at all of this fire up a tool like SAS, SPSS, Matlab or use R in Revolution or R studio to look at the statistical relationships between the variables and work out an algorithm that would make the prediction. That’s fine if you have a statistics background but what about the rest of us?
We could use the little known data mining add ins in excel but I doubt many of you even know about this or that you need to have a SQL Server Analysis Services instance running somewhere to do the work. In any case that’s only good for a million rows and in this example I have the 2.9million rows of flights just for one quarter of 2012. The other problem we might also have is that any deep analysis of this data may require some serious computing power and or some expensive software. To compound this we don’t really have enough information to work out the benefits of this work so justifying resources for this is going to be difficult.
It’s for this reason that Azure Machine Learning (ML) is going to be really important. The first thing about this new technology is that it runs on Azure which means it’s pay per use, there’s no software to install, and like any cloud service it scales to meet demand. Just as importantly Azure ML is very agile is has a simple UI based on connecting up various modules to do the prediction based on industry leading algorithms. The final bit of secret source in Azure ML is the way it allows you to publish a model (experiment in ML speak) as a pair of APIs (one for batch one for transactions) that can be consumed by other services.
However even if we do all of this work (I have a more detailed post on this and a lab guide around this scenario here) our prediction might be far from perfect for the simple reason we have the wrong weather. Aircraft fly at about 35-40,000’ but our weather readings are those taken at 0 feet at the airports. So what we really want is the weather data at altitude and this might be achieved by getting headwind readings from aircraft already in flight and using this to make predictions on upcoming flights.
My point here is that what can seem like really hard problems can be solved by expanding our thinking, by being aware of disruptive technologies like Azure ML and the Internet of Things (or aircraft). That’s why I am passionate about the continued professional development (CPD) that is mandatory in other professions like law, accountancy and needs to be for us.
Note: For your latest data orientated CPD you could do worse than come to a Data Culture Series event