TNWiki Article Spotlight: Apache Hadoop on Windows

Article
08/28/2012

Check out the full version of the article here:

Hadoop-based Services For Windows (en-US)

This blog post is a preview of the content in that article (you'll find 3-5 times more information in the TNWiki article). The article (and many others about Hadoop) is written by Wesley McSwain, SQL Server technical writer.

Hadoop Overview

Apache Hadoop is an open source software framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It consists of two primary components: Hadoop Distributed File System (HDFS) , a reliable and distributed data storage, and MapReduce , a parallel and distributed processing system. A Hadoop cluster can be made up of a single node or thousands.

HDFS is the primary distributed storage used by Hadoop applications. As you load data into a Hadoop cluster, HDFS splits up the data into blocks/chunks and creates multiple replicas of blocks and distributes them across the nodes of the cluster to enable reliable and extremely rapid computations.

Getting Started with Hadoop-based Services for Windows

The links in this section provide information on deploying Apache Hadoop to Microsoft Windows Platforms. All these articles are on TechNet Wiki:

Link	Description
Getting Started with Hadoop-based Services for Windows	An overview of the Getting Started guides currently available.
Getting Started a Hadoop cluster on the Elastic Map Reduce Portal.	A walkthrough for provisioning and using a temporary Hadoop cluster on the Elastic Map Reduce Portal (EMR) Portal.

Using Hadoop with other BI Technologies

This section contains information on using Hadoop with other BI technologies. All these articles are on TechNet Wiki:

Link	Description
How to Connect Excel to Hadoop on Azure via HiveODBC	Explains how to use Excel 2010 to access data in the Hive data warehouse running on Windows Azure by using the Hive ODBC Driver.
How to Connect Excel PowerPivot to Hive on Azure via HiveODBC	Explains how to use PowerPivot to access data in the Hive data warehouse running on Windows Azure by using the Hive ODBC Driver.

How To

This section contains a list of Hadoop-related how-to articles. All these articles are on TechNet Wiki:

Link	Description
Hadoop-based Services on Windows Azure How-Tos and FAQs	A collection of common How To topics along with FAQs.
How to Run a Job on a Provisioned Hadoop on Windows Azure Cluster	Information about creating Map Reduce jobs on a cluster that has been provisioned on the Elastic Map Reduce (EMR) portal
How To FTP Data to Hadoop on Windows Azure	A walkthrough for using FTPS to send file data to the cluster
How to create a mapper and reducer in C# (Hadoop Streaming)	A walkthough for creating a mapper and reducer in C# using Hadoop Streaming
Use SQL Azure database as a Hive metastore	Information about using SQL Azure database as a Hive metastore

Check out the article and add to it here (it's a log bigger than the sections I featured in this blog post):

Hadoop-based Services For Windows (en-US)

Jump on in. The Wiki is warm!

- Ninja Ed

TNWiki Article Spotlight: Apache Hadoop on Windows

Hadoop Overview

Getting Started with Hadoop-based Services for Windows

Using Hadoop with other BI Technologies

How To

Additional resources