Mahout for Dummies (3) – Step-by-Step: Mahout with HDInsight PowerShell Style

In the blog series Mahout for Dummies options on how to use Mahout in HDInsight are being explored and elaborated.

Contents

1 What is Mahout?
2 Step-by-Step: Mahout with HDInsight Interactive Style
3 Step-by-Step: Mahout with HDInsight PowerShell Style

 

Step-by-Step: Mahout with HDInsight PowerShell Style

In this episode of the series Mahout for Dummies, we deal with Mahout on HDInsight in a PowerShell manner. Ultimately, we go through the Random Forest Scenario detailed in previous post.

  1. Upload Data
  2. Create HDInsight Cluster
  3. Mahout: general PowerShell command
  4. Scenario: Random Forest
    1. Build forest
    2. Classify test data
  5. Clean up
  6. Scenario: Recommender Job
  7. Wrapping up…

1. Upload Data

Here, we upload all data to the Azure blob storage necessary to build a random forest model from and then to test the model on. More specifically, training and test data will be uploaded. Note that information on the storage account (e.g. container name and storage context) must already be known.

001002003004005006007008009010011012013 ## 1. File Paths# Data stored locally$localTrain = "C:\<TrainingDataPath>\KDDTrain+.arff"$localTest = "C:\<TestDataPath>\KDDTest+.arff"# Data to be stored in Azure Blob Storage$blobTrain = "testdata/KDDTrain+.arff"$blobTest = "testdata/KDDTest+.arff"## 2. Upload file from local to Azure Blob StorageSet-AzureStorageBlobContent -File $localTrain -Container $containerName ` -Blob $blobTrain -Context $storageContextSet-AzureStorageBlobContent -File $localTest -Container $containerName ` -Blob $blobTest -Context $storageContext

3 Data 1 MS eg

Since Mahout is not installed on any HDInsight cluster by default (and hence not supported by Microsoft), the Mahout jar files also shall have to be uploaded to the blob storage.

001002003004005006007008009010011012013 ## 1. File Paths# Mahout jar files stored locally$localMahoutJar = "C:\<PathToMahoutDistribution>\mahout-core-0.9-job.jar"$localMahoutEx = "C:\<PathToMahoutDistribution>\mahout-examples-0.9-job.jar"# Mahout jar files to be stored in Azure Blob Storage$blobMahoutJar = "mahout/mahout-core-0.9-job.jar"$blobMahoutEx = "mahout/mahout-examples-0.9-job.jar"## 2. Upload file from local to Azure Blob StorageSet-AzureStorageBlobContent -File $localMahoutJar -Container $containerName ` -Blob $blobMahoutJar -Context $storageContextSet-AzureStorageBlobContent -File $localMahoutEx -Container $containerName ` -Blob $blobMahoutEx -Context $storageContext

 

3 Data 1 MS eg 3

2. Create HDInsight Cluster

We just create a simple HDInsight cluster, just like in the Azure PowerShell Series: Simple HDInsight. Alternatively, you could create one with additional functionality; see Azure PowerShell Series: Custom Create HDInsight.

001002003004005006007008009010011 # Input$clusterName = "<HDInsightClusterName>"$clusterCreds = Get-Credential$numNodes = 4# Simple createNew-AzureHDInsightCluster -Name $clusterName -Subscription $subID ` -Location $location -DefaultStorageAccountName $storageAccount ` -DefaultStorageAccountKey $storageKey ` -DefaultStorageContainerName $containerName -Credential $clusterCreds ` -ClusterSizeInNodes $numNodes -Version 2.1

In the Azure Explorer, you observe some libraries being uploaded, such as mapred, hive, etc.

2 HDInsight 1

Just like in the previous post Step-by-Step: Mahout with HDInsight Interactive Style, both the training and test data need to be located in the directory user/hdp/

001002003004005006 $blobHDPtrain = "user/hdp/testdata/KDDTrain+.arff"$blobHDPtest = "user/hdp/testdata/KDDTest+.arff"Set-AzureStorageBlobContent -File $localTrain -Container $containerName ` -Blob $blobHDPtrain -Context $storageContextSet-AzureStorageBlobContent -File $localTest -Container $containerName ` -Blob $blobHDPtest -Context $storageContext

 

3. Mahout: General PowerShell Command

The typical command for invoking Mahout from the Hadoop Command Line via RDP connection looks as follows:

001002003 hadoop jar C:\apps\dist\mahout-0.9\mahout-core-0.9-job.jar org.apache.mahout.classifier.df.tools.Describe -p wasb:///user/hdp/testdata/KDDTrain+.arff ...

Thus, it is an ordinary command running the program contained in specified JAR file. org.apache.mahout.classifier.df.tool.Describe is the class name being invoked, followed by mandatory and optional arguments. Translated into PowerShell:

001002003004 $mahoutJob = New-AzureHDInsightMapReduceJobDefinition ` -JarFile "<PathToMahoutJAR>/mahout-core-0.9-job.jar" ` -ClassName "<ClassName>" ` -Arguments "-p wasb:///user/hdp/testdata/KDDTrain+.arff …"

In the case above, this translates into the following PowerShell command:

001002003004 $mahoutJob = New-AzureHDInsightMapReduceJobDefinition ` -JarFile "wasb://$containerName@$storageAccount.blob.core.windows.net/$blobMahoutJar" ` -ClassName "org.apache.mahout.classifier.df.tools.Describe" ` -Arguments "-p wasb:///user/hdp/$blobTrain -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L"

or a little more elaborate:

001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030 $mahoutJob = New-AzureHDInsightMapReduceJobDefinition ` -JarFile "wasb://$containerName@$storageAccount.blob.core.windows.net/$blobMahoutJar" ` -ClassName "org.apache.mahout.classifier.df.tools.Describe"# path to training data$mahoutJob.Arguments.Add("-p")$mahoutJob.Arguments.Add("wasb:///user/hdp/$blobTrain")# path to generated descriptor file$mahoutDescriptor.Arguments.Add("-f")$mahoutDescriptor.Arguments.Add("wasb:///user/hdp/testdata/KDDTrain+.info")# attributes of given training data$mahoutDescriptor.Arguments.Add("-d")$mahoutDescriptor.Arguments.Add("N")$mahoutDescriptor.Arguments.Add("3")$mahoutDescriptor.Arguments.Add("C")$mahoutDescriptor.Arguments.Add("2")$mahoutDescriptor.Arguments.Add("N")$mahoutDescriptor.Arguments.Add("C")$mahoutDescriptor.Arguments.Add("4")$mahoutDescriptor.Arguments.Add("N")$mahoutDescriptor.Arguments.Add("C")$mahoutDescriptor.Arguments.Add("8")$mahoutDescriptor.Arguments.Add("N")$mahoutDescriptor.Arguments.Add("2")$mahoutDescriptor.Arguments.Add("C")$mahoutDescriptor.Arguments.Add("19")$mahoutDescriptor.Arguments.Add("N")$mahoutDescriptor.Arguments.Add("L")

Note that the PowerShell commandlets have so far only defined the job but not triggered it yet. The Hadoop Job is started by the following command:

001002 $mahoutJobProcessing = Start-AzureHDInsightJob -Cluster $clusterName ` -JobDefinition $mahoutJob -Credential $clusterCreds

To automatically wait for the HDInsight job to process, you can insert the following

001 Wait-AzureHDInsightJob -Job $mahoutJobProcessing -WaitTimeoutInSeconds 3600

It gives an hour (i.e. 3600 seconds) for the HDInsight job to process . You can print out any output error as follows:

001002 Get-AzureHDInsightJobOutput -Cluster $clusterName -Subscription $subID ` -JobId $mahoutJobProcessing.JobId -StandardError

 

4. Scenario: Random Forest

In the previous section, we elaborated on how to construct a Mahout Job as a PowerShell command. Here, we go through an example using the Random Forest, just like in the previous post Step-by-Step: Mahout with HDInsight Interactive Style – Scenario Random Forest.

4.1. Build Forest

As a reminder, the command we used to build a forest in Interactive Style is the following:

001002003004005006 hadoop jar C:\apps\dist\mahout-0.9\mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -d wasb:///user/hdp/testdata/KDDTrain+.arff -ds wasb:///user/hdp/testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest

Thus, the “translated” PowerShell command is

001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036 ## build forest$mahoutForest = New-AzureHDInsightMapReduceJobDefinition ` -JarFile "wasb://$containerName@$storageAccount.blob.core.windows.net/$blobMahoutEx" ` -ClassName "org.apache.mahout.classifier.df.mapreduce.BuildForest"# maximum data size per node$mahoutForest.Arguments.Add("-Dmapred.max.split.size=1874231")# data path$mahoutForest.Arguments.Add("-d")$mahoutForest.Arguments.Add("wasb:///user/hdp/testdata/KDDTrain+.arff")# dataset path$mahoutForest.Arguments.Add("-ds")$mahoutForest.Arguments.Add("wasb:///user/hdp/testdata/KDDTrain+.info")# number of variables being randomly selected at each node$mahoutForest.Arguments.Add("-sl")$mahoutForest.Arguments.Add("5")# flag for partial implementation$mahoutForest.Arguments.Add("-p")# number of trees$mahoutForest.Arguments.Add("-t")$mahoutForest.Arguments.Add("100")# output path for generated forest$mahoutForest.Arguments.Add("-o")$mahoutForest.Arguments.Add("nsl-forest")# start job$mahoutForestProcessing = Start-AzureHDInsightJob -Cluster $clusterName ` -JobDefinition $mahoutForest# wait for jobWait-AzureHDInsightJob -Subscription $subID -Job $mahoutForestProcessing ` -WaitTimeoutInSeconds 3600# print out error if anyGet-AzureHDInsightJobOutput -Cluster $clusterName -Subscription $subID ` -JobId $mahoutForestProcessing.JobId -StandardError

The output in PowerShell should look like this:

4 RF 1

 

4.2. Classify Test Data

The “converted” PowerShell command of the classifying command proposed in Interactive Style is as follows:

001002003004005006007008009010011012013014015016017018 $mahoutClassify = New-AzureHDInsightMapReduceJobDefinition ` -JarFile "wasb://$containerName@$storageAccount.blob.core.windows.net/$blobMahoutEx" ` -ClassName "org.apache.mahout.classifier.df.mapreduce.TestForest"$mahoutClassify.Arguments.Add("-i")$mahoutClassify.Arguments.Add("wasb:///user/hdp/testdata/KDDTest+.arff")$mahoutClassify.Arguments.Add("-ds")$mahoutClassify.Arguments.Add("wasb:///user/hdp/testdata/KDDTrain+.info")$mahoutClassify.Arguments.Add("-m")$mahoutClassify.Arguments.Add("wasb:///user/hdp/nsl-forest")$mahoutClassify.Arguments.Add("-a")$mahoutClassify.Arguments.Add("-mr")$mahoutClassify.Arguments.Add("-o")$mahoutClassify.Arguments.Add("predictions")$mahoutClassifyJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $mahoutClassifyWait-AzureHDInsightJob -Job $mahoutClassifyJob -WaitTimeoutInSeconds 3600Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $mahoutClassifyJob.JobId -StandardError

4 RF 2

4 RF 3

Note that the Output shown above has the same format and very similar results to the previous post when done in interactive style.

5. Clean Up

Cleaning up involves the likes of removing and HDInsight cluster but also removing temporary directories. While the PowerShell command for deleting a single file is pretty straight forward, i.e.

001 Remove-AzureStorageBlob -Container $containerName -Context $storageContext -Blob $file

deleting a folder structure comprises a loop in which every single file with specified file path prefix is removed.

001002003004005006007008009010011012013014 ## a. Remove temp directory$blobPrefix = "user/hdp/temp"$tempFiles = Get-AzureStorageBlob -Container $containerName -Context $storageContext -prefix $blobPrefixWrite-Host "Removing temp directory"foreach ($item in $tempFiles){ $tmpFile = $item.Name Write-Host "Deleting $tmpFile" Remove-AzureStorageBlob -Container $containerName -Context $storageContext -Blob $tmpFile}## b. Delete HDInsight clusterRemove-AzureHDInsightCluster -Name $clusterName

6. Scenario: Recommender Job

As we saw in the first part of our Mahout for Dummies series, there are many algorithms included in the Mahout library other than the random forest.

In the blog by the Big Data Support team at Microsoft, there is a good post demonstrating the use of the RecommenderJob class on an HDInsight Cluster using PowerShell that you can read here. The source code of the RecommenderJob class can be looked up here on GitHub.

In this scenario, we are given two data files: one containing user ID’s and the other comprising the degrees of preference of users towards given items:

5 rec items 2 In ItemID.txt, the first column indicates the user ID, the second the item IDs and the final one denotes the degree of preference. Thus, ItemID.txt can be expressed in a more intuitive Format of a matrix, where the rows indicate the user and the columns denote the item IDs. The values inside the matrix themselves display the degree of preference, as given in the third column in ItemID.txt.

5 rec items 3

Here is the comprised PowerShell script for running RecommenderJob as in the Big Data Support blog.

001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150 ########################################################################################### Mahout with HDInsight: RecommenderJob (Collaborative Filtering)## Check out Microsoft's Big Data Support blog# https://blogs.msdn.com/b/bigdatasupport/archive/2014/02/19/mahout-with-hdinsight.aspx## Source code in GitHub:# https://gfdithub.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java########################################################################################### 0. Azure Account DetailsAdd-AzureAccount$subName = "<AzureSbscriptionName>"Select-AzureSubscription $subName# Azure account details automatically set$subID = Get-AzureSubscription -Current | %{ $_.SubscriptionId } ############################################################################################ 1. Input information## a. storage account$storageAccount = "<StorageAccountName>"$containerName = "<StorageContainerName>"$location = "<DatacenterLocation>" #e.g. North Europe# if storage account not created yet#New-AzureStorageAccount -StorageAccountName $storageAccount -Location $location#Set-AzureStorageAccount -StorageAccountName $storageAccount -GeoReplicationEnabled $false# Variables automatically set for you$storageKey = Get-AzureStorageKey $storageAccount | %{ $_.Primary } $storageContext = New-AzureStorageContext -StorageAccountName $storageAccount -StorageAccountKey $storageKey$fullStorage = "${storageAccount}.blob.core.windows.net"# if container not created yetNew-AzureStorageContainer -Name $containerName -Context $storageContext## b. HDInsight Cluster$clusterName = "<HDInsightClusterName>"$clusterCreds = Get-Credential -Message "New admin account to be created for your HDInsight cluster"# best: user name = admin$numNodes = 4## c. Data# Data stored locally$localFolder = "C:\<localFilesPath>"$localItems = "$localFolder\ItemID.txt"$localUsers = "$localFolder\users.txt"$localMahoutJar = "C:\<PathToMahoutDistribution>\mahout-core-0.9-job.jar"# Data to be stored in Azure Blob Storage$blobMahoutJar = "mahout/mahout-core-0.9-job.jar"$blobFolder = "testdata"$blobItems = "$blobFolder/ItemID.txt"$blobUsers = "$blobFolder/users.txt"########################################################################################### 2. Upload file from local to Azure Blob Storage# Mahout jarWrite-Host "Copying Mahout JAR into Blob Storage" -BackgroundColor GreenSet-AzureStorageBlobContent -File $localMahoutJar -Container $containerName -Blob $blobMahoutJar -Context $storageContext# data for RecommenderJobWrite-Host "Copying necessary data into Blob Storage" -BackgroundColor GreenSet-AzureStorageBlobContent -File $localItems -Container $containerName -Blob $blobItems -Context $storageContextSet-AzureStorageBlobContent -File $localUsers -Container $containerName -Blob $blobUsers -Context $storageContext########################################################################################### 3. Create HDInsight Cluster# Simple createNew-AzureHDInsightCluster -Name $clusterName -Subscription $subID -Location $location ` -DefaultStorageAccountName $storageAccount -DefaultStorageAccountKey $storageKey ` -DefaultStorageContainerName $containerName -Credential $clusterCreds -ClusterSizeInNodes $numNodes ` -Version 2.1########################################################################################### 4. Mahout# Mahout Job defining the appropriate JAR file and the class name$mahoutJob = New-AzureHDInsightMapReduceJobDefinition ` -JarFile "wasb:///$blobMahoutJar" ` -ClassName "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob"# Similarity class name.# Alternative similarity classes: loglikelihood, tanimoto coeff, # city block, cosine, pearson correlation, euclidean distance$mahoutJob.Arguments.Add("-s")$mahoutJob.Arguments.Add("SIMILARITY_COOCCURRENCE")# Input path to file with preference data$mahoutJob.Arguments.Add("-i")$mahoutJob.Arguments.Add("wasb:///$blobItem")# path to file containing use IDs for which recommendations will be computed$mahoutJob.Arguments.Add("--usersFile")$mahoutJob.Arguments.Add("wasb:///$blobUsers")# path for recommender output$mahoutJob.Arguments.Add("--output")$mahoutJob.Arguments.Add("wasb:///$blobFolder/output")# Starting job$mahoutJobProcessing = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $mahoutJob -Debug# Waiting Job for completionWait-AzureHDInsightJob -Job $mahoutJobProcessing -WaitTimeoutInSeconds 3600 -Debug# Getting error if anyGet-AzureHDInsightJobOutput -Cluster $clusterName -JobId $mahoutJobProcessing.JobId -StandardError########################################################################################### 5. Clean up, i.e. remove temp directory## a. Remove temp directory$blobPrefix = "user/hdp/temp"$tempFiles = Get-AzureStorageBlob -Container $containerName -Context $storageContext -prefix $blobPrefixWrite-Host "Removing temp directory"foreach ($item in $tempFiles){ $tmpFile = $item.Name Write-Host "Deleting $tmpFile" Remove-AzureStorageBlob -Container $containerName -Context $storageContext -Blob $tmpFile}## b. Delete HDInsight clusterRemove-AzureHDInsightCluster -Name $clusterName

 

The output files can be seen in the Azure Blob Storage Explorer as ususal:

5 rec output 1

The output file itself gives information on how likely which items could be of interest to which users, i.e. <user_id> [<item_id>: <degree-of-preference/interest>,…].

5 rec output 2

In such a way, we can insert the recommendations with their scores in the matrix from above:

5 rec output 3

 

7. Wrapping Up…

In this blog post, we went through two scenarios applying Mahout on HDInsight in PowerShell style: random forest and recommender. These scenarios are nicely wrapped around the usual suspects: uploading data, creating the HDInsight cluster and cleaning up afterwards.

Many thanks go to Alexei Khalyako and Bill Carroll for their support on Mahouting on HDInsight!