Mahout for Dummies (1)

Contents

1 What is Mahout?
2 Step-by-Step: Mahout with HDInsight Interactive Style
3 Step-by-Step: Mahout with HDInsight PowerShell Style

What is Mahout?

mahout-logo-300

Apache Mahout, 15.04.2014

Mahout is one of many Hadoop-related projects at Apache. Its mission is to build a scalable machine learning and data mining library. In other words, Mahout provides data science tools useful for detecting meaningful patterns in given data sets that are stored in HDFS (Hadoop Distributed File System). It is implemented on top of Hadoop and as of version 0.9 based on the infamous MapReduce paradigm.

Why the word Mahout? Traditionally, a mahout is an elephant rider and has its origins in the Hindi language. The mahout starts early on as a boy when being assigned an elephant.

Well, back to the machine learning library – that would contain numerous algorithms! Mahout is based on three “C-pillars” of machine learning implementations:

  • Collaborative filtering (aka recommendation),
  • Clustering, and
  • Classification.

 

Collaborative Filtering (aka Recommendation)

Amazon_Funny

Customers Who Bought This Item Also Bought, PaulsHealthBlog.com, 11.04.2014

You were looking at a product in Amazon, and there it is – a list of items recommended to you based on what other users also considered buying when looking at “your” product. Such recommender engines (also to be found in Netflix, Spotify, etc.) comprises all kinds of collaborative filtering algorithms. User behaviour is being mined to observe patterns and use as a recommendation for other users with similar likes and dislikes.

Clustering

grid.ai

Microsoft Clustering Algorithm , 11.04.2014

This family of machine learning entails the grouping of data units into natural clusters since they share similar characteristics. For instance, you tend to cluster customers into groups according to demographic information, say, without labelling these groups yet. Or we naturally group most food into sweet or salty things.

Classification

Given a dataset that we can learn from and build a data model, we then can classify new unknown data items. For instance, the eye colour is genetically influenced by more than one gene. By learning which genes would result in blue eyes, we can predict the eye colour of other people based on their genetic information.

classification

What is the difference between clustering and classification? While in classification you are already given certain categories to classify your data, clustering involves naturally similar items. In other words, in the example of the blue eyes, we know from the beginning what we are looking for: blue eyes or no blue eyes, whereas labelling the groups shall still be established after clustering.

How does Mahout work?

Mahout provides the implementations of various ML algorithms – a list of them can be found on their site in the list of algorithms. Each one of them can be invoked via a command line. How it is done with HDInsight and PowerShell will be shown in the upcoming blog entries: Step-by-Step: Mahout with HDInsight Interactive Style and Step-by-Step: HDInsight with Mahout PowerShell Style.