Mahout for Dummies (2) – Step-by-Step: Mahout and HDInsight Interactive Style

Article
04/14/2014

In the blog series on Mahout for Dummies, we now get our hands dirty. Let’s see Mahout in action on an HDInsight cluster.

1 What is Mahout?
2 Step-by-Step: Mahout with HDInsight Interactive Style
3 Step-by-Step: Mahout with HDInsight PowerShell Style

Step-by-Step: Mahout with HDInsight Interactive Style

But before heading right into Mahout, the HDInsight cluster shall be created. Please note that as of now, Mahout is NOT supported by Microsoft. Mahout is by default not installed on any HDInsight Cluster, but can be done in various ways, such as connecting to the head node via RDP or using PowerShell (see ~~next article~~).

Update: Mahout is contained in the HDInsight Version 3.1 by default; more information can be found in the documentation What’s new in the Hadoop cluster Versions provided by HDInsight? Thus the second step Install Mahout on HDInsight can be skipped.

This article is a step-by-step guide on how to install and use Mahout on an HDInsight Cluster with a specific scenario involving Random Forests.

1. Create HDInsight Cluster

Prerequisite: You have already created a storage account.

Start off with going to the Microsoft Azure portal and click on new.

Pick a Name for your HDInsight cluster. Please note that we are using HDInsight Version 2.1. Set the datacenter region to the same region that your storage account is located in, in this case North Europe.

Obviously you configure credentials to your HDInsight Cluster:

As mentioned above, the prerequisite is that you have already created a storage account. To have a clean slate, I create a default container on which the HDInsight Cluster is based.

The process of creating an HDInsight Cluster includes a few Milestones such as configuring the provisioned virtual machines. The beauty of HDInsight is that you do not need to provision so-and-so-many virtual machines and install Hadoop on them – HDInsight provides this as a Service and does it automatically.

Once created, let us enable a remote desktop connection:

And let’s connect!

2. Install Mahout

Mahout is provided by HDInsight Version 3.1 by default. As you have connected remotely to it, open the file explorer and browse to C:\apps\dist. There you can see a list of Hadoop components supported by HDInsight 3.1:

Hence, you can ignore the rest of this paragraph and skip right to 3. Scenario: Random Forest.

In case you do deal with an earlier HDInsight Version (earlier than 3.1), then follow the steps described in this paragraph.

You can find the latest release version of Mahout on https://mahout.apache.org/ that you can download locally on your computer.

In the head node of your HDInsight Cluster (you have connected to it via RDPin the end of 1. Create HDInsight Cluster), open the File Explorer to create a new folder C:\ , such as C:\MyFiles.

Copy the Mahout Distribution zip file into C:\MyFiles in the head node. In this case, the latest release is version 0.9.

Extract the zip-file into C:\apps\dist.

Rename the extracted folder into mahout-x.x where x.x is the version number.

And that’s it – Mahout is installed on your HDInsight cluster!

3. Scenario: Random Forest

This step-by-step guide is based on the one documented in Mahout – Classifying with random forests only tailored to HDInsight 2.1.

1. Get Data

Before building a forest model, we need data to learn from as well as data to test our model from. The data sets used here can be downloaded from https://nsl.cs.unb.ca/NSL-KDD /. I am using KDDTrain+ARFF and KDDTest+.ARFF.

Check in the downloaded files that unnecessary lines are removed. More specifically, in KDDTrain+.arff remove the first 44 lines (i.e. all lines starting with @attribute). Otherwise, we will not be able to generate a descriptor file later on in 3.2.

Copy these two data files into C:\MyFiles in the head node (just like with the Mahout Distribution earlier).

Ok, we have our training and test data in our HDInsight cluster, but for Mahout do its magic on the precious data, the data needs to be in the HDFS (Hadoop Distributed File System). Yes, you are right – in HDInsight, we do not use HDFS; all data is stored in the Azure Blob Storage. Instead, the HDFS API is still used in HDInsight. So, all we need to do is copy local data into the Blob Storage.

There are many ways transferring local data into the blob storage. Here, we use the Hadoop shell commands. First we create a directory called testdata. Then we copy.

 hdfs dfs -mkdir testdata
hdfs dfs -copyFromLocal C:/MyFiles/KDDTrain+.arff testdata/KDDTrain+.arff
hdfs dfs -copyFromLocal C:/MyFiles/KDDTest+.arff testdata/KDDTest+.arff

Many use the shell command

 hadoop fs

but this command is deprecated.

Here’s a tip to avoid typing in the whole path: Copy path in the file explorer.

To see what is stored in all my storage accounts, I often use cerebrata’s Azure Explorer, but there are many other storage explorers recommended by us all listed here.

To double check, we see that the now copied data files are located in user/testdata/. Note that olivia is the user I configured to be the remote user, hence can remotely connect to the head node of my HDInsight cluster.

The way Mahout is compiled, the data needs to be in user/hdp/ though. In the Hadoop command line we’ll type in:

 hdfs dfs -cp wasb://oliviakrf@oliviakstor.blob.core.windows.net/user/olivia/testdata/KDDTrain+.arff 
    wasb://oliviakrf@oliviakstor.blob.core.windows.net/user/hdp/testdata/KDDTrain+.arff
hdfs dfs -cp wasb://oliviakrf@oliviakstor.blob.core.windows.net/user/olivia/testdata/KDDTest+.arff 
    wasb://oliviakrf@oliviakstor.blob.core.windows.net/user/hdp/testdata/KDDTest+.arff

More generally, just replace the variables in <> with the names you have chosen accordingly (i.e. container, storage account and remote user):

 hdfs dfs -cp wasb://<container>@<storage-account>.blob.core.windows.net/user/<remote-user>/testdata/KDDTrain+.arff 
    wasb://<container>@<storage-account>.blob.core.windows.net/user/hdp/testdata/KDDTrain+.arff
hdfs dfs -cp wasb://<container>@<storage-account>.blob.core.windows.net/user/<remote-user>/testdata/KDDTest+.arff 
    wasb://<container>@<storage-account>.blob.core.windows.net/user/hdp/testdata/KDDTest+.arff

Checking in Azure Explorer, the two data files can be found under user/hdp/testdata as desired.

2. Generate descriptor file

Before building a random forest model based on the training data in KDDTrain+.arff, a descriptor file is essential. Why? When building the model, all information in the training data needs to be labelled for the algorithm to know, which one is numerical, categorical or a label.

The command is as follows:

 hadoop jar C:\apps\dist\mahout-0.9\mahout-core-0.9-job.jar 
org.apache.mahout.classifier.df.tools.Describe 
-p wasb:///user/hdp/testdata/KDDTrain+.arff 
-f testdata/KDDTrain+.info 
-d N 3 C 2 N C 4 N C 8 N 2 C 19 N L

Here, the main class of org.apache.mahout.classifier.df.tools.Describe is invoked; for more information on the source code, check out Mahout’s GitHub site on Describe.java. It takes three mandatory arguments, namely: p (short for path), f (short for file) and d (short for descriptor). Other optional arguments are h (help), r (regression) and Options. The p argument specifies the path where the data to be described is located, f defines the location for generated descriptor file and d provides Information on all attributes of given data, where N=numerical, C=categorical and L=label. More specifically, “N 3 C 2 N C 4 N C 8 N 2 C 19 N L” means that given data set starts off with a numerical attribute (N), followed by 3 categorical attributes (C), etc. and lastly with a Label (L).

Update: Note that if you use an HDInsight cluster based on version 3.1, change the path specifying the mahout jar file to
C:\apps\dist\mahout-0.9.0.2.1.5.0-2057\core\target\mahout-core-0.9.0.2.1.5.0-2057-job.jar

Since the descriptor file also needs to be in the directory user/hdp/, you either then copy the generated descriptor file into user/hdp/ or you might as well set parametre f to wasb:///user/hdp/testdata/KDDTrain+.info

 hdsf dfs 
-cp wasb://<container>@<storageaccount>.blob.core.windows.net/user/<remoteuser>/testdata/KDDTrain+.info
wasb://<container>@<storageaccount>.blob.core.windows.net/user/hdp/testdata/KDDTrain+.info

Or generating the descriptor file in user/hdp/ straight away:

 hadoop jar C:\apps\dist\mahout-0.9\mahout-core-0.9-job.jar 
org.apache.mahout.classifier.df.tools.Describe 
-p wasb:///user/hdp/testdata/KDDTrain+.arff 
-f wasb:///user/hdp/testdata/KDDTrain+.info 
-d N 3 C 2 N C 4 N C 8 N 2 C 19 N L

Update: If you use an HDInsight cluster of version 3.1, use the following path specifying the Mahout jar:

C:\apps\dist\mahout-0.9.0.2.1.5.0-2057\core\target\mahout-core-0.9.0.2.1.5.0-2057-job.jar

Checking in the Azure Explorer, we see KDDTrain+.info in the Directory user/hdp/testdata:

3. Build forest

Now we can finally build the random forest using the following command in the Hadoop command line:

 hadoop jar C:\apps\dist\mahout-0.9\mahout-examples-0.9-job.jar 
org.apache.mahout.classifier.df.mapreduce.BuildForest 
-Dmapred.max.split.size=1874231 
-d wasb:///user/hdp/testdata/KDDTrain+.arff 
-ds wasb:///user/hdp/testdata/KDDTrain+.info 
-sl 5 -p -t 100 -o nsl-forest

In GitHub, you can look into the source code of used class (BuildForest) on GitHub. Thus, the mandatory arguments of the main class in BuildForest are d (data), ds (dataset), t (nbtrees), o (output). A more comprehensive list of arguments is the following:

Short name	Long name	Mandatory / Optional	Description
d	data	Mandatory	Data path
ds	dataset	Mandatory	Dataset path
sl	selection	Optional	#Variables to select randomly at each tree node. Classification: Default = square root of #explanatory vars. Regression: Default = 1/3 of #explanatory vars.
nc	no-complete	Optional	Tree is not complete.
ms	minsplit	Optional	Tree node is not divided, if branching data size < given value. Default: 2.
mp	minprop	Optional	Tree node is not divided, if Proportion of the variance of branching data < given value. Used for Regression. Default: 0.001.
sd	seed	Optional	Seed value used to initialise the random number generator.
p	partial	Optional	Use partial data implementation
t	nbtrees	Mandatory	#Trees to grow
o	output	Mandatory	Output path that will contain the decision forest
h	help	Optional	Help

In other words, a random forest model is computed on the basis of data provided in KDDTrain+.arff with additional description information in KDDTrain+.info with 100 trees saved in the directory nsl-forest/. The computation of a random forest uses the partial implementation (-p) and splits the dataset at each tree node by randomly selecting 5 attributes (-sl), whilst allowing a maximum of 1,874,231 data units per node (-Dmapred.max.split.size). Note that the maximum number of data units per node also indicates the partition size of each tree in the random forest, in this case 1/10 of the dataset; thus, 10 partitions are being used.

Update: As mentioned in 3.2 Generate Descriptor File , if you use an HDInsight cluster of version 3.1, use the following path for the mahout example jar:

C:\apps\dist\mahout-0.9.0.2.1.5.0-2057\examples\target\mahout-examples-0.9.0.2.1.5.0-2057-job.jar

The result in the Hadoop command line will look like this:

In the end, we can see how long it took to build the forest and also obtain further Information on the forest, such as the number of nodes or mean maximum depth of the forest.

To use the generated forest for classifying unknown test data, we copy it via hdfs commands into user/hdp/ as follows:

 hdfs dfs 
-cp wasb://oliviakrf@oliviakstor.blob.core.windows.net/user/olivia/nsl-forest 
wasb://oliviakrf@oliviakstor.blob.core.windows.net/user/hdp/nsl-forest

or more generally speaking:

 hdfs dfs 
-cp wasb://<container>@<storageaccount>.blob.core.windows.net/user/<remoteuser>/nsl-forest 
wasb://<container>@<storageaccount>.blob.core.windows.net/user/hdp/nsl-forest

In Azure Explorer you can see that the generated forest model (forest.seq) is stored in both user/olivia/ and user/hdp/.

4. Classify new data

We have generated a forest model in the step before in order to automatically classify new incoming data, i.e. KDDTest+.arff. The command we use here is

 hadoop jar C:\apps\dist\mahout-0.9\mahout-examples-0.9-job.jar 
org.apache.mahout.classifier.df.mapreduce.TestForest 
-i wasb:///user/hdp/testdata/KDDTest+.arff 
-ds wasb:///user/hdp/testdata/KDDTrain+.info 
-m wasb:///user/hdp/nsl-forest 
-a -mr -o predictions

Update: If you use a cluster of HDInsight 3.1, use the following path name for the mahout examples jar file:

C:\apps\dist\mahout-0.9.0.2.1.5.0-2057\examples\target\mahout-examples-0.9.0.2.1.5.0-2057-job.jar

As usual more information can be found on Mahout’s GitHub site, more concretely in TestForest.java. What do the arguments mean? The mandatory arguments are input (-i) for the test data location, dataset (-ds) for the descriptor file location, model (-m) for the forest model location and output (-o) for the output location; optional boolean arguments are analyze (-a) for analysing the classification results, i.e. computing the confusion matrix, and mapreduce (-mr) to use Hadoop to distribute classification.

In this case, predictions are computed for the new test data located in wasb:///user/hdp/testdata/KDDTest+.arff with its associated descriptor file in wasb:///user/hdp/testdata/KDDTrain+.info using the previously built random forest in wasb:///user/hdp/nsl-forest; the output predictions are then stored in a text file in the directory predictions/. Additionally, a confusion matrix is computed (as you can see below in the Hadoop command line) and classification is being distributed using Hadoop.

The predictions are stored in user/olivia/predictions:

5. Woah, what’s happening?

Ok, so what just happened? Let’s first have a look at the summary in the Hadoop command line and another closer look at the output file containing the predictions.

The test data in KDDTest+.arff contained 22,544 instances that were classified in 1.4. Classify new data, and that you can also see in the first section Summary under Total Classified Instances. Thus you can see that 17,783 instances of them (i.e. 78%) were correctly classified, whereas 4,761 (21%) were incorrectly classified.

More details are provided in the confusion matrix, in nicer view:

In other words, 9,458 normal instances were correctly classified but 253 normal instances were incorrectly classified as anomaly, adding up to 9,711 actual normal instances. There are 17,783 correctly classified instances (= 9,458 + 8,325, i.e. normal-normal + anomaly-anomaly) compared to 4,761 (= 4,508 + 253) incorrectly classified instances.

The remaining statistics measures (Kappa and reliability) indicate a degree on overall consistency of measure, more specifically of agreement between raters.

And finally, what about the predictions that have been saved as an output of classifying the test data?

After some converting, you obtain a list of numbers of type double. What each double number indicates is the predicted category of each data instance, where 1.0 denotes the category anomaly.

4. Wrapping up…

We have created an HDInsight cluster such that the Mahout library (in this case version 0.9) could subsequently be installed. The Mahout library is very extensive and can be explored at its full glory in its GitHub site. Here, we went through a scenario using one of many Machine Learning religions, namely the Random Forest, based on the random forest tutorial on the Mahout site but tailored to HDInsight.

Update: There is an extensive guide on how to use Mahout on HDInsight to generate movie recommendations, found here on the Azure documentation – highly recommendable!

In the next Mahout article, we will explore the use of Mahout through the awesomeness of PowerShell.