Connecting R Studio to Hadoop via Hive

This is a step-by-step guide to connecting an R Studio client session running on a PC to a remote Hive server running on Hadoop.

Although Hive is getting a bit long in the tooth and is falling out of fashion, this is a very easy way to publish data from a Hadoop cluster to end-user analysts / data-scientists.

In this example I connected my PC client to a Virtual Machine running Hadoop and queried 2.2 million rows of data into an R data-frame in about 1 minute – so Hive might be slow compared to other approaches, but it is simple, accessible and “fast enough” for distributing data that has been pre-sumarised in Hadoop.

Preparing the Hive Data

This is not covered in detail, but in order to access the data from Hive, it is assumed that there is a running Hadoop cluster with some data in it, a running Hive service and a Hive table defined that can be queried.

The process described in this blog entry was tested against a CDH 4.5 distribution of Hadoop with Hive 1.1. running.

Either create a Hive table directly and insert data (see the manuals: https://cwiki.apache.org/confluence/display/Hive/LanguageManual) or create a Hive table over raw HDFS text data in a directory to make the HDFS data accessible from Hive – EG:

CREATE EXTERNAL TABLE ukcrimesum (
date STRING
, lsoa STRING
, lsoa_name STRING
, bicycle_theft INT
, burglary INT
, damage_or_arson INT
, drugs INT
, missing_data INT
, other_crime INT
, other_theft INT
, public_order INT
, robbery INT
, shoplifting INT
, social INT
, theft_person INT
, unclassified INT
, vehicle_crime INT
, violence_sex INT
, weapons INT
, total_classified INT)
ROW FORMAT delimited
fields terminated by ','
LOCATION '/user/oracle/crimesumdir'

This table was created for an example working with the UK national crime data located here: https://data.police.uk/data/

Verify that the Hive table can be queried locally on the Hadoop cluster before moving on to getting remote connectivity via R working.

EG:

$> hive
hive> select count(*) from ukcrimesum;
...
...
Total MapReduce CPU Time Spent: 7 seconds 160 msec
OK
2264407
Time taken: 36.868 seconds, Fetched: 1 row(s)
hive>
...
...
hive> exit;

Connect R Client to Remote Hive 2 Server

There are three key preparation steps before we can make a remote connection to the Hive server from our R client:

Identify the TCP Port that Hive is running on
Identify the remote Host Name / IP Address to connect to the Hive Server
Copy Java JAR Files into place on the local PC client – and make sure these are available in the R session

Hive TCP Port: By default this is port 10000 – configuring Hive is not my specialty – if it is not set to 10000 consult the manuals!

Remote Host-Name / IP Address: Can be obtained from your Sys Admin, or if that is you – check the output of

ifconfig -a

on the server for the IP address

Copying Java Class Files for R Client Connectivity

These need to be copied from the Hadoop Server to the PC Client.

This can be quite fiddly to get right. The Hive JDBC drivers required by R are:

/usr/lib/hive/lib/hive-jdbc.jar
/usr/lib/hive/lib/libthrift-0.9.2.jar
/usr/lib/hive/lib/hive-service.jar
/usr/lib/hive/lib/httpclient-4.2.5.jar
/usr/lib/hive/lib/httpcore-4.2.5.jar
/usr/lib/hive/lib/hive-jdbc-standalone.jar
/usr/lib/hadoop/client/hadoop-common.jar

The paths specified are the location where they can be found on the CDH 4.5 Hadoop Distribution that I used – this is the same as the Oracle Big Data Lite VM that I tested this against. More details on this are here.

Use Linux tar to copy the files into a tar archive on the server so they can be copied to your PC client. NOTE:

it is easier for installation on a PC if you strip off the relative path from tar-archive, so I created a shell-script to cd to the local directory before tar-ing up the files
Some of the files listed are symbolic links; you need to tar up the real file at the end of the symbolic link so use the -h option on tar to do this.

Below is a shell script to create a tar archive on the Hadoop server with all the necessary jar files. For later versions of Hadoop it may be necessary to update the version numbers appended to the file-names:

Preparation – create two files hive_jars_local.lst and hadoop_jars_local.lst. These have contents as follows

hive_jars.lst

hive-jdbc.jar
libthrift-0.9.2.jar
hive-service.jar
httpclient-4.2.5.jar
httpcore-4.2.5.jar
hive-jdbc-standalone.jar

hadoop_jars.lst

hadoop-common.jar

BASH Shell Script to Collect JAR files

CWD=`pwd`

cd /usr/lib/hive/lib
tar -cvhf $CWD/hive_jars.tar -T $CWD/hive_jars.lst
cd /usr/lib/hadoop/client/

tar -cvhf $CWD/hadoop_jars.tar -T $CWD/hadoop_jars.lst

cd $CWD

Save the shell script above and run it on the Hadoop server – it should create two tar archives: hadoop_jars.tar and hive_jars.tar

Use scp to copy these tar files to your PC and then unpack the tar files (using 7-zip or equivalent) in a suitable directory. In this example we are using a Windows PC and they are copied to the directory D:\R\java

Once the JAR files are in place, we can start R-Studio (or the regular R client) and connect to the remote Hive data-source.

Query a Remote Hive Table in R

The following R packages need to be installed and loaded:

library("DBI")
library("rJava") &nbsp;
library("RJDBC")

NOTE: It may be necessary to add the directory location of the jvm.dll to the Windows PATH environment variable setting for the rJava library to operate correctly.

Once these packages have been installed and loaded, set the classpath in the R session to point to the Hive JAR files installed previously:

 cp = c("D:/R/java/hive-jdbc.jar"
, "D:/R/java/hadoop-common.jar"
, "D:/R/java/libthrift-0.9.2.jar"
, "D:/R/java/hive-service.jar"
, "D:/R/java/httpclient-4.2.5.jar"
, "D:/R/java/httpcore-4.2.5.jar"
, "D:/R/java/hive-jdbc-standalone.jar")

.jinit(classpath=cp)

If there are Class-Path errors, use the following rJava call to debug:

.jclassLoader()$setDebug(1L)

Next, connect to the remote Hive service from the R session:

drv = JDBC("org.apache.hive.jdbc.HiveDriver"
    , "hive-jdbc.jar" )
conn = dbConnect(drv
    , "jdbc:hive2://bigdatalite:10000/default"
    , "oracle", "")

In the example above we are connecting to a remote host called “bigdatalite” – this can be replaced by any appropriate hostname or an IP address. The example assumes the Hive service is listening on default port 10000. The Hive user is specified as “oracle”.

Finally, we can access table structures and data in Hive in our local R session:

List available Hive Databases

> databases <- dbGetQuery(conn, "show databases")
> databases
.   database_name
1         default
2       moviedemo
3       moviework

List Tables

> tables <- dbGetQuery(conn, "show tables")
> tables
.               tab_name
1             crime_demo
2                   cust
3    media_demo_customer
4    media_demo_movielog
5                  movie
6           movie_rating
7          movie_updates
8             movie_view
9      movieapp_log_avro
10     movieapp_log_json
11 movieapp_log_odistage
12    movieapp_log_stage
13              movielog
14         session_stats
15            ukcrimesum
16            user_movie

Describe a Table Structure

> ukcnames <- dbGetQuery(conn, "describe ukcrimesum")
> ukcnames
col_name    data_type comment
1 date           string
2 lsoa           string
3 lsoa_name      string
4 bicycle_theft     int
5 burglary          int
6 damage_or_arson   int
7 drugs             int
8 missing_data      int
9 other_crime       int
10 other_theft      int
11 public_order     int
12 robbery          int
13 shoplifting      int
14 social           int
15 theft_person     int
16 unclassified     int
17 vehicle_crime    int
18 violence_sex     int
19 weapons          int
20 total_classified int

Query Hive Data into an R Data-Frame

> Sys.time()
[1] "2016-10-13 15:04:47 BST"
> ukcrimesum <- dbGetQuery(conn, "select * from ukcrimesum")
> Sys.time()
[1] "2016-10-13 15:05:35 BST"
> dim(ukcrimesum)
[1] 2264407      20

So, here we can see 2.2 million rows of data being queried into an R data-frame in under a minute. The query is fast as there is no Hive aggregation being performed. The example shows just a straight pull of all the data, but now we can use a where-clause to pull out the appropriate subset of data that we want to work with.

16 thoughts on “Connecting R Studio to Hadoop via Hive”

Dan says:

May 25, 2017 at 8:53 pm

Thank you so much for this helpful post. For some reason, it did not work for me when I used
drv <- JDBC("org.apache..hive.jdbc.HiveDriver", but when I used
drv <- JDBC("org.apache.hadoop.hive.jdbc.HiveDriver",(with hadoop) it did work. However, I am still unable to execute the "conn" line of code, and am faced with the error that the URL is incorrect, even though I am certain it is correct. Do you have any idea why this might be the case? Thanks in advance.

LikeLike

pygot says:

May 26, 2017 at 12:53 pm

Hi – you are very welcome! It looks like different versions of Hive have a different hierarchy in the package naming convention. There is a thread on this topic here: https://stackoverflow.com/questions/25157273/connect-from-java-to-hive-using-jdbc
– I wasn’t aware of this issue, so thanks for pointing it out.
I have
# hadoop version
Hadoop 2.6.0-cdh5.7.0
# hive –version
Hive 1.1.0-cdh5.7.0
and
drv = JDBC(“org.apache.hive.jdbc.HiveDriver”, “hive-jdbc.jar”)
works for me. What version of Hadoop and Hive are you on? If you let me know I will update the blog post with the details then.

LikeLike

pygot says:

May 26, 2017 at 12:57 pm

also, have you done the basics and tried ping-ing the host name from your client? Also, if you do
netstat -tulpn | grep 10000
on the hadoop node you are pointing at, can you see the Hive process listening on your target port

LikeLike

abhishekrai10dec says:

September 21, 2017 at 12:31 pm

hi site owner can you please instruct the step by step for connecting the Hive with knime. As m not able to get the proper document for that. im using hive-0.13.0 version, Apache hadoop-2.6.0 and the knime latest application Analytics.

It would be a great help if you can do this.

LikeLike

1. pygot says:
  
  September 21, 2017 at 8:09 pm
  
  Sorry, I don’t know anything about KNIME, but there are a couple of posts on their support forum pages that look like they will help – i.e. https://www.knime.com/forum/knime-general/how-to-connect-to-hive
  
  LikeLike
  
Sunmathi says:

November 14, 2017 at 11:08 am

Please clear this issue imediately. My error is:
Error in .jcall(drv@jdrv, “Ljava/sql/Connection;”, “connect”, as.character(url)[1], :
java.lang.NoClassDefFoundError: org/apache/hive/service/cli/thrift/TCLIService$Iface

LikeLike

1. pygot says:
  
  November 14, 2017 at 1:22 pm
  
  Looks like you have missing Jar file or incorrect class-path setting. Check you have all the correct files copied in place with the right permissions and your path is referencing them. Then use the debug option mentioned in my post to diagnose further. HTH.
  
  LikeLike
  
  1. Sunmathi says:
    
    November 15, 2017 at 4:06 am
    
    Please tell me the correct path where I have to copy my jar files? Whether in Java or rJava. And Class path settings is in environmental variables or in RStudio? I am not sure of these things. Please clarify me… Thanks in advance…
    
    LikeLike
Sunmathi says:

November 15, 2017 at 5:57 am

JAR file, can get ‘org/apache/hive/service/cli/thrift/TCLIService$Iface’? NO
>> ClassNotFoundException

I can’t understand this error. I am facing this issue for the past two days. please clarify me.

LikeLike

Sunmathi says:

November 22, 2017 at 4:51 am

I got connection to hive from r. But I can’t view the table because it gives the following error.. please tell me about the issue as soon as possible.

Error in .jcall(rp, “I”, “fetch”, stride, block) :
org.apache.hive.service.cli.HiveSQLException: java.io.IOException: java.lang.RuntimeException: serious problem

LikeLike

Victor Lin says:

November 23, 2017 at 7:47 am

when I try to write a data frame to hive table. I got the following error

Error in .local(conn, statement, …) :
execute JDBC update query failed in dbSendUpdate (Error while compiling statement: FAILED: ParseException line 1:53 cannot recognize input near ‘.’ ‘year_month’ ‘VARCHAR’ in column type)

R script
> data dbWriteTable(jdbc_con, “year_month_master_tmp”, data, overwrite=TRUE)

any idea?

LikeLike

1. pygot says:
  
  November 26, 2017 at 8:56 pm
  
  Sorry, I don’t know the answer to this – will try to investigate further next week – may be a bug?
  https://github.com/r-dbi/odbc/issues/36
  Please let me know if you make any progress. thanks.
  
  LikeLike
  
Priyanka says:

June 21, 2018 at 10:55 am

Error in .jfindClass(as.character(driverClass)[1]) : class not found
Getting this error.. pls help me

LikeLike

1. pygot says:
  
  June 22, 2018 at 10:40 am
  
  Sounds like you need to check the setup of your Java class-path OR the necessary Java jar files aren’t in place on your PC. Take a look at the section in the article that discusses this and maybe try the “.jclassLoader()$setDebug(1L)” option to debug further if necessary.
  
  LikeLike
  
TDP says:

January 8, 2020 at 3:45 pm

Hi I am getting this error while trying to connect –

Error in .jcall(drv@jdrv, “Ljava/sql/Connection;”, “connect”, as.character(url)[1], :
java.lang.IllegalArgumentException: java.lang.ClassCastException: java.util.Vector cannot be cast to java.util.Hashtable

LikeLike

1. pygot says:
  
  January 10, 2020 at 9:18 am
  
  I’ve not seen this before. Please can you paste the driver-load “drv” spec and connection definition “conn” that you have used.
  
  LikeLike

	pygot on Connecting R Studio to Hadoop…
	TDP on Connecting R Studio to Hadoop…
	pygot on Connecting R Studio to Hadoop…
	Priyanka on Connecting R Studio to Hadoop…
	pygot on Connecting R Studio to Hadoop…

pygot

exploring Python, R, Data Science, Databases and Analytics

Connecting R Studio to Hadoop via Hive

Preparing the Hive Data

Connect R Client to Remote Hive 2 Server

Copying Java Class Files for R Client Connectivity

Query a Remote Hive Table in R

16 thoughts on “Connecting R Studio to Hadoop via Hive”

Leave a comment Cancel reply

Preparing the Hive Data

Connect R Client to Remote Hive 2 Server

Copying Java Class Files for R Client Connectivity

Query a Remote Hive Table in R

Share this:

16 thoughts on “Connecting R Studio to Hadoop via Hive”

Leave a comment Cancel reply