This blog post was created collaboratively by the winning team of the Big Data Hackathon in Data Visualization category (Vancouver, April 18-19, 2015).
As upper level students finishing their Computer Science degrees, we felt that the Big Data Hackathon allowed us an opportunity to put our learning into practice. Each of us had experiences in varying coursework and only some of us had experience in attending a Data Mining class. With that preface, we did not have any expectation in winning but to do our best and quickly apply our collective knowledge in creating a product in the limited time frame.
On the day of the competition, we merged two teams to form a team of six – which is an unprecedented size. Despite our size, we split off into teams of two to complete cohesive parts that all relate to the topic we decided on – visualizing crime data in Vancouver. We chose that topic due to the recent shooting events around Metro Vancouver and wanted to see if we could see and predict what sort of neighborhood is safe, and also whether the property values affected it.
Our data source was directly from the open data provided by the City of Vancouver. We originally planned to analyze more cities, however, the amount of data was an issue as not every municipality released data. Furthermore, those that did may not have comparable datasets that we could use to compare and contrast between.
Manipulating the data
Because the Property Values are presented in per-household format and the Crime is presented by Address Blocks, we had to find a method to join the data sets. We formatted the data to enable the join to happen on the street name. This was a excellent use of the Azure ML to process the huge data sets as the estimated join was over a million rows. Because this inner join was so crude, we needed to eliminate the incorrect “joins” that were not within proximity to the house address / block. Therefore, we utilized a custom R script that we wrote to eliminate the extraneous data that resulted from the join. We fed the R script we wrote into Azure ML in order to efficiently process the million or so rows.
After we got this final CSV file, another team utilized that to visualize a crime heat map via Power BI
Displaying the data
Since the dataset we had chosen contained about a million rows, we needed a good way to represent the data in an easily understandable form. From the Microsoft products we were introduced to such as Excel, Power Query, Power Pivot, Power View and Power BI Dashboard, we were able to quickly use them to visualize our large amount of data in several different format. We used Power Query to efficiently clean up any mismatched formats in our data, then using PowerPivot we graphed several of different representations of our data. One of which was utilizing the animated scatterplot to visualize the change and growth in different types of crimes over the a given range of years of data. In addition the Power BI tools allowed us to visualize the rates of crime in an given area using a heat map allowing us to easily present our data.
We honestly did not expect to win the competition as most of us had very little experience with Machine Learning and Big Data – not to mention the tools that Microsoft created. We learned a lot about the possible opportunities that are available in the industry, what tools that the industry utilizes and also had an opportunity to create a fun widget over the weekend!
Watch data visualization video: