In this blog series, we set up a Hadoop cluster on Azure using virtual machines running Linux. More specifically, we use the HDP 2.1 on Linux distribution by Hortonworks that also provides the HDP distributions for the Windows platform. Furthermore, we install Hadoop with Ambari, an Apache project that provides an intuitive UI for provisioning, managing and monitoring a Hadoop cluster.
While HDInsight is the Platform as a Service (PaaS) option for building and running a Hadoop cluster in Microsoft Azure, this article specifies its IaaS (Infrastructure as a Service) counterpart. With the IaaS option you have more flexibility in the choice of Hadoop distributions, Hadoop components and platform (e.g. Linux), amongst others.
This blog series elaborates the install of Hortonworks’ Hadoop distribution for Linux, HDP 2.1 for Linux. Alternatives for commercial Hadoop distributions on Linux include Cloudera (CDH) and MapR. Moreover, we will use CentOS as the Linux platform. In the end, we will have a four-node Hadoop cluster: one master node (also called NameNode) and three worker nodes (also called DataNode):
We heavily base our step-by-step guide on Benjamin’s great article How to install Hadoop on Windows Azure Linux virtual machines and Hortonworks’ documentation Hortonworks Data Platform – Automated Install with Ambari.
Before installing a Hadoop distribution though, the required environment needs to be prepared. Thus, the next article walks through the infrastructure setup for such a cluster on Microsoft Azure.