Connecting R to HDinsight through HIVE

With the powerful big data platform that Microsoft provides through Azure HDinsight, and with the wide range of data scientists and statisticians utilizes R, this Post is who to bring the best of both and connect R to HDinsight through Hive connector. so that you can analyze hive tables in R where they resides on the Azure HDinsight cluster so let’s see the steps

    1. Download Microsoft Hive ODBC driver from here
    2. Install the Microsoft Hive ODBC driver use either the x86 or the x64 ( take care of the version to use the same with R)
    3. Configure your DSN in the ODBC Data Sources
    1. go to Control Panel > Administrative Tools > ODBC Data Sources (64-bit)
    2. open System DSN click add
    3. choose Microsoft Hive ODBC Driver and click Finish
    4. enter the fields
    1. Data Source Name: the data source name we’ll use in R so name it anything i’ll call it now HiveOnAzure
    2. Description: write your desription
    3. Host: get it from your Azure Manage site [yourclustername].azurehdinsight.net
    4. port: leave it 443
    5. Database: leave it “default”
    6. Hive server type: use Hive Server 2
    7. Mechanism: Windows Azure HDinsight (it automatically configures the port and Database above)
    8. HTTP Path: leave it blank
    9. username: your username that you entered while creating the cluster
    10. password: your password that you entered while creating the cluster
    11. then test the connectivity you should receive a connection successful established message
    12.  
  1. Now after establishing the ODBC driver connectivity to Azure we’ll shift to R

  • Open RStudio (make sure to use the same x64 or x86 version as you’ve configured in the ODBC drivers)
    1. install the RODBC package
  > install.packages("RODBC")
  > library(RODBC)
      1. create the ODBC connection in R
  > myconn <- odbcConnect("HiveOnAzure",uid="[YOUR_USERNAME_HERE]",pwd="[YOU_PASSWORD_HERE]" )
      1. run your HiveQL Query and return the data into a data frame
  > alldata <- sqlFetch(myconn,"Select * from hivesampletable")
      1. inspect the retrieved data
  > head(alldata,10)

Now you’ve successfully connected your R to the Hive on HDisnight on Azure to pass your HiveQL Queries and start doing the analysis you want to create.