This guest post is by the faculty of our upcoming Data Science MOOC, Dr. Stephen Elston, Managing Director at Quantia Analytics & Professor Cynthia Rudin from M.I.T.
Aspiring data scientists can improve their skills with the upcoming edX course where we will delve deeply into the tools you need for effective data munging, along with other essential skills. Don’t forget to register!
Data fuels data science. And clean and complete data is essential for successful machine learning (ML). Preparing your data for ML must be performed by applying the right skills and tools. The result of cleaning, integrating, and transforming your data is better ML model performance. In summary, if you are an aspiring data scientist you must master data munging.
The data preparation processes, or data munging, is iterative and sometimes time consuming – you can easily spend 80% of the time on a data science project on just this aspect. As you proceed through data exploration, model construction and model evaluation, improvements are made. At each step you evaluate the quality and suitability of the data and determine what additional improvements are needed to achieve the desired results. For example, you might find that certain data points bias the results of a ML model. You will then need to create a filter to deal with this problem.
Aspiring data scientists must have a good data munging toolkit available. R and Python tools are widely employed today. The R dplyr and tidyr packages or the Python pandas package are ideal for many data munging tasks. Additionally, the Microsoft Azure ML Studio provides a drag and drop environment with powerful data munging modules.
The skills to perform data munging are even more important than having an excellent toolkit. As you develop your data science skills you must develop an understanding of at least the following processes:
Treating missing values.
Ensuring consistent coding of features.
Cleaning outliers and errors.
Joining data from multiple sources and tables.
- Scaling features.
Binning or coding of categorical features.
A data scientist must know how and when to apply these processes. The need for these processes is based on a deep understanding of the relationships in the data and a careful evaluation of model performance.
To illustrate these points, let’s have a look at a data munging example. A workflow in an experiment in Microsoft Azure ML Studio is shown in the figure below.
These data contain eight physical characteristics of 768 simulated buildings. These characteristics, or features, are used to predict the buildings’ heating load and cooling load, measures of energy efficiency. The ability to predict a building’s energy efficiency is valuable in a number of circumstances. For example, architects may need to compare the energy efficiency of several building designs before selecting a final approach.
You can find this data as a sample data set in the Azure ML Studio, or it can be downloaded from the UCI Machine Learning Repository. These data are discussed in the paper by A. Tsanas, A. Xifara: “Accurate quantitative estimation of energy performance of residential buildings using statistical ML tools”, Energy and Buildings, Vol. 49, pp. 560-567, 2012.
Before we can perform meaningful data visualization or ML these data must be prepared. This experiment contains multiple data munging steps. The need for these steps is determined through careful examination of the data and evaluation of ML model performance. In this experiment, most data preparation is performed with the easy to use built in Azure ML modules. Some more specific transformations are preformed using the R language. We could just as well use the Python language.
First, we remove the cooling load column as this variable is highly collinear to the label column, heating load.
Next, we use a Metadata Editor model to convert some of the features to categorical. Even though these features have numerical values, the numbers are not meaningful. For example, building orientation refers to how the building is oriented. The numerical value tells us nothing about an orientation direction.
We use another Metadata editor module to change some column names. Specifically, we remove any spaces, which can cause problems with R.
Finally, we scale and zero center the numerical columns. Depending on the units some features can have numerical values in the tens, while others can be thousands or millions. Scaling and centering, or normalization, prevents bias problems when training a ML model with numeric features.
Let’s have a more detailed look at one of these data munging steps. Running in the Execute R Script module, the code listed below adds some new features to the data set. Namely, this code computes new features (columns) containing the square of the original feature values. We can test if these columns improve the performance of the model.
eeframe <- maml.mapInputPort(1)
## Add some polynomial columns to the data frame.
eeframe = mutate(eeframe,
RelativeCompactnessSqred = RelativeCompactness^2,
SurfaceAreaSqred = SurfaceArea^2,
WallAreaSqred = WallArea^2)
Alternatively, we can create the same new features using the following Python code in an Execute Python Script module:
## Add some polynomial columns to the data frame.
sqrList = ["Relative Compactness", "Surface Area", "Wall Area"]
sqredList = ["Relative Compactness Sqred", "Surface Area Sqred", \
"Wall Area Sqred"]
frame1[sqredList] = frame1[sqrList]**2
Data munging is an essential data science skill. Successful ML requires properly prepared data. Data munging is a multi-faceted, systematic process. Enjoy your exploration of data munging.
Stephen & Cynthia