Connecting R Studio to Hadoop via Hive

This is a step-by-step guide to connecting an R Studio client session running on a PC to a remote Hive server running on Hadoop.

Although Hive is getting a bit long in the tooth and is falling out of fashion, this is a very easy way to publish data from a Hadoop cluster to end-user analysts / data-scientists.

In this example I connected my PC client to a Virtual Machine running Hadoop and queried 2.2 million rows of data into an R data-frame in about 1 minute – so Hive might be slow compared to other approaches, but it is simple, accessible and “fast enough” for distributing data that has been pre-sumarised in Hadoop.

Preparing the Hive Data

This is not covered in detail, but in order to access the data from Hive, it is assumed that there is a running Hadoop cluster with some data in it, a running Hive service and a Hive table defined that can be queried.

The process described in this blog entry was tested against a CDH 4.5 distribution of Hadoop with Hive 1.1. running.

Either create a Hive table directly and insert data (see the manuals: https://cwiki.apache.org/confluence/display/Hive/LanguageManual) or create a Hive table over raw HDFS text data in a directory to make the HDFS data accessible from Hive – EG:

CREATE EXTERNAL TABLE ukcrimesum (
date STRING
, lsoa STRING
, lsoa_name STRING
, bicycle_theft INT
, burglary INT
, damage_or_arson INT
, drugs INT
, missing_data INT
, other_crime INT
, other_theft INT
, public_order INT
, robbery INT
, shoplifting INT
, social INT
, theft_person INT
, unclassified INT
, vehicle_crime INT
, violence_sex INT
, weapons INT
, total_classified INT)
ROW FORMAT delimited
fields terminated by ','
LOCATION '/user/oracle/crimesumdir'

This table was created for an example working with the UK national crime data located here:  https://data.police.uk/data/

Verify that the Hive table can be queried locally on the Hadoop cluster before moving on to getting remote connectivity via R working.

EG:

$> hive
hive> select count(*) from ukcrimesum;
...
...
Total MapReduce CPU Time Spent: 7 seconds 160 msec
OK
2264407
Time taken: 36.868 seconds, Fetched: 1 row(s)
hive>
...
...
hive> exit;

Connect R Client to Remote Hive 2 Server

There are three key preparation steps before we can make a remote connection to the Hive server from our R client:

  • Identify the TCP Port that Hive is running on
  • Identify the remote Host Name / IP Address to connect to the Hive Server
  • Copy Java JAR Files into place on the local PC client – and make sure these are available in the R session

Hive TCP Port:  By default this is port 10000 – configuring Hive is not my specialty – if it is not set to 10000 consult the manuals!

Remote Host-Name / IP Address:  Can be obtained from your Sys Admin, or if that is you – check the output of

ifconfig -a

on the server for the IP address

 

Copying Java Class Files for R Client Connectivity

These need to be copied from the Hadoop Server to the PC Client.

This can be quite fiddly to get right.  The Hive JDBC drivers required by R are:

/usr/lib/hive/lib/hive-jdbc.jar
/usr/lib/hive/lib/libthrift-0.9.2.jar
/usr/lib/hive/lib/hive-service.jar
/usr/lib/hive/lib/httpclient-4.2.5.jar
/usr/lib/hive/lib/httpcore-4.2.5.jar
/usr/lib/hive/lib/hive-jdbc-standalone.jar
/usr/lib/hadoop/client/hadoop-common.jar

The paths specified are the location where they can be found on the CDH 4.5 Hadoop Distribution that I used – this is the same as the Oracle Big Data Lite VM that I tested this against.  More details on this are here.

Use Linux tar to copy the files into a tar archive on the server so they can be copied to your PC client.  NOTE:

  1. it is easier for installation on a PC if you strip off the relative path from tar-archive, so I created a shell-script to cd to the local directory before tar-ing up the files
  2. Some of the files listed are symbolic links; you need to tar up the real file at the end of the symbolic link so use the -h option on tar to do this.

Below is a shell script to create a tar archive on the Hadoop server with all the necessary jar files.  For later versions of Hadoop it may be necessary to update the version numbers appended to the file-names:

Preparation  – create two files hive_jars_local.lst and hadoop_jars_local.lst.  These have contents as follows

hive_jars.lst

hive-jdbc.jar
libthrift-0.9.2.jar
hive-service.jar
httpclient-4.2.5.jar
httpcore-4.2.5.jar
hive-jdbc-standalone.jar

hadoop_jars.lst

hadoop-common.jar

BASH Shell Script to Collect JAR files

CWD=`pwd`

cd /usr/lib/hive/lib
tar -cvhf $CWD/hive_jars.tar -T $CWD/hive_jars.lst
cd /usr/lib/hadoop/client/

tar -cvhf $CWD/hadoop_jars.tar -T $CWD/hadoop_jars.lst

cd $CWD

Save the shell script above and run it on the Hadoop server – it should create two tar archives: hadoop_jars.tar and hive_jars.tar

Use scp to copy these tar files to your PC and then unpack the tar files (using 7-zip or equivalent) in a suitable directory.  In this example we are using a Windows PC and they are copied to the directory D:\R\java

Once the JAR files are in place, we can start R-Studio (or the regular R client) and connect to the remote Hive data-source.

Query a Remote Hive Table in R

The following R packages need to be installed and loaded:

library("DBI")
library("rJava")  
library("RJDBC")

NOTE: It may be necessary to add the directory location of the jvm.dll to the Windows PATH environment variable setting for the rJava library to operate correctly.

Once these packages have been installed and loaded, set the classpath in the R session to point to the Hive JAR files installed previously:

 cp = c("D:/R/java/hive-jdbc.jar"
, "D:/R/java/hadoop-common.jar"
, "D:/R/java/libthrift-0.9.2.jar"
, "D:/R/java/hive-service.jar"
, "D:/R/java/httpclient-4.2.5.jar"
, "D:/R/java/httpcore-4.2.5.jar"
, "D:/R/java/hive-jdbc-standalone.jar")
.jinit(classpath=cp)

If there are Class-Path errors, use the following rJava call to debug:

.jclassLoader()$setDebug(1L)

Next, connect to the remote Hive service from the R session:

drv = JDBC("org.apache.hive.jdbc.HiveDriver"
    , "hive-jdbc.jar" )
conn = dbConnect(drv
    , "jdbc:hive2://bigdatalite:10000/default"
    , "oracle", "")

In the example above we are connecting to a remote host called “bigdatalite” – this can be replaced by any appropriate hostname or an IP address.  The example assumes the Hive service is listening on default port 10000.  The Hive user is specified as “oracle”.

Finally, we can access table structures and data in Hive in our local R session:

List available Hive Databases

> databases <- dbGetQuery(conn, "show databases")
> databases
.   database_name
1         default
2       moviedemo
3       moviework

List Tables

> tables <- dbGetQuery(conn, "show tables")
> tables
.               tab_name
1             crime_demo
2                   cust
3    media_demo_customer
4    media_demo_movielog
5                  movie
6           movie_rating
7          movie_updates
8             movie_view
9      movieapp_log_avro
10     movieapp_log_json
11 movieapp_log_odistage
12    movieapp_log_stage
13              movielog
14         session_stats
15            ukcrimesum
16            user_movie

Describe a Table Structure

> ukcnames <- dbGetQuery(conn, "describe ukcrimesum")
> ukcnames
col_name    data_type comment
1 date           string
2 lsoa           string
3 lsoa_name      string
4 bicycle_theft     int
5 burglary          int
6 damage_or_arson   int
7 drugs             int
8 missing_data      int
9 other_crime       int
10 other_theft      int
11 public_order     int
12 robbery          int
13 shoplifting      int
14 social           int
15 theft_person     int
16 unclassified     int
17 vehicle_crime    int
18 violence_sex     int
19 weapons          int
20 total_classified int

Query Hive Data into an R Data-Frame

> Sys.time()
[1] "2016-10-13 15:04:47 BST"
> ukcrimesum <- dbGetQuery(conn, "select * from ukcrimesum")
> Sys.time()
[1] "2016-10-13 15:05:35 BST"
> dim(ukcrimesum)
[1] 2264407      20

So, here we can see 2.2 million rows of data being queried into an R data-frame in under a minute.  The query is fast as there is no Hive aggregation being performed. The example shows just a straight pull of all the data, but now we can use a where-clause to pull out the appropriate subset of data that we want to work with.

16 thoughts on “Connecting R Studio to Hadoop via Hive

  1. Thank you so much for this helpful post. For some reason, it did not work for me when I used
    drv <- JDBC("org.apache..hive.jdbc.HiveDriver", but when I used
    drv <- JDBC("org.apache.hadoop.hive.jdbc.HiveDriver",(with hadoop) it did work. However, I am still unable to execute the "conn" line of code, and am faced with the error that the URL is incorrect, even though I am certain it is correct. Do you have any idea why this might be the case? Thanks in advance.

    Like

  2. Hi – you are very welcome! It looks like different versions of Hive have a different hierarchy in the package naming convention. There is a thread on this topic here: https://stackoverflow.com/questions/25157273/connect-from-java-to-hive-using-jdbc
    – I wasn’t aware of this issue, so thanks for pointing it out.
    I have
    # hadoop version
    Hadoop 2.6.0-cdh5.7.0
    # hive –version
    Hive 1.1.0-cdh5.7.0
    and
    drv = JDBC(“org.apache.hive.jdbc.HiveDriver”, “hive-jdbc.jar”)
    works for me. What version of Hadoop and Hive are you on? If you let me know I will update the blog post with the details then.

    Like

  3. also, have you done the basics and tried ping-ing the host name from your client? Also, if you do
    netstat -tulpn | grep 10000
    on the hadoop node you are pointing at, can you see the Hive process listening on your target port

    Like

  4. hi site owner can you please instruct the step by step for connecting the Hive with knime. As m not able to get the proper document for that. im using hive-0.13.0 version, Apache hadoop-2.6.0 and the knime latest application Analytics.

    It would be a great help if you can do this.

    Like

  5. Please clear this issue imediately. My error is:
    Error in .jcall(drv@jdrv, “Ljava/sql/Connection;”, “connect”, as.character(url)[1], :
    java.lang.NoClassDefFoundError: org/apache/hive/service/cli/thrift/TCLIService$Iface

    Like

    1. Looks like you have missing Jar file or incorrect class-path setting. Check you have all the correct files copied in place with the right permissions and your path is referencing them. Then use the debug option mentioned in my post to diagnose further. HTH.

      Like

      1. Please tell me the correct path where I have to copy my jar files? Whether in Java or rJava. And Class path settings is in environmental variables or in RStudio? I am not sure of these things. Please clarify me… Thanks in advance…

        Like

  6. JAR file, can get ‘org/apache/hive/service/cli/thrift/TCLIService$Iface’? NO
    >> ClassNotFoundException

    I can’t understand this error. I am facing this issue for the past two days. please clarify me.

    Like

  7. I got connection to hive from r. But I can’t view the table because it gives the following error.. please tell me about the issue as soon as possible.

    Error in .jcall(rp, “I”, “fetch”, stride, block) :
    org.apache.hive.service.cli.HiveSQLException: java.io.IOException: java.lang.RuntimeException: serious problem

    Like

  8. when I try to write a data frame to hive table. I got the following error

    Error in .local(conn, statement, …) :
    execute JDBC update query failed in dbSendUpdate (Error while compiling statement: FAILED: ParseException line 1:53 cannot recognize input near ‘.’ ‘year_month’ ‘VARCHAR’ in column type)

    R script
    > data dbWriteTable(jdbc_con, “year_month_master_tmp”, data, overwrite=TRUE)

    any idea?

    Like

    1. Sounds like you need to check the setup of your Java class-path OR the necessary Java jar files aren’t in place on your PC. Take a look at the section in the article that discusses this and maybe try the “.jclassLoader()$setDebug(1L)” option to debug further if necessary.

      Like

  9. Hi I am getting this error while trying to connect –

    Error in .jcall(drv@jdrv, “Ljava/sql/Connection;”, “connect”, as.character(url)[1], :
    java.lang.IllegalArgumentException: java.lang.ClassCastException: java.util.Vector cannot be cast to java.util.Hashtable

    Like

Leave a comment