This is a step-by-step guide to connecting an R Studio client session running on a PC to a remote Hive server running on Hadoop.
Although Hive is getting a bit long in the tooth and is falling out of fashion, this is a very easy way to publish data from a Hadoop cluster to end-user analysts / data-scientists.
In this example I connected my PC client to a Virtual Machine running Hadoop and queried 2.2 million rows of data into an R data-frame in about 1 minute – so Hive might be slow compared to other approaches, but it is simple, accessible and “fast enough” for distributing data that has been pre-sumarised in Hadoop.
Preparing the Hive Data
This is not covered in detail, but in order to access the data from Hive, it is assumed that there is a running Hadoop cluster with some data in it, a running Hive service and a Hive table defined that can be queried.
The process described in this blog entry was tested against a CDH 4.5 distribution of Hadoop with Hive 1.1. running.
Either create a Hive table directly and insert data (see the manuals: https://cwiki.apache.org/confluence/display/Hive/LanguageManual) or create a Hive table over raw HDFS text data in a directory to make the HDFS data accessible from Hive – EG:
CREATE EXTERNAL TABLE ukcrimesum ( date STRING , lsoa STRING , lsoa_name STRING , bicycle_theft INT , burglary INT , damage_or_arson INT , drugs INT , missing_data INT , other_crime INT , other_theft INT , public_order INT , robbery INT , shoplifting INT , social INT , theft_person INT , unclassified INT , vehicle_crime INT , violence_sex INT , weapons INT , total_classified INT) ROW FORMAT delimited fields terminated by ',' LOCATION '/user/oracle/crimesumdir'
This table was created for an example working with the UK national crime data located here: https://data.police.uk/data/
Verify that the Hive table can be queried locally on the Hadoop cluster before moving on to getting remote connectivity via R working.
EG:
$> hive hive> select count(*) from ukcrimesum; ... ... Total MapReduce CPU Time Spent: 7 seconds 160 msec OK 2264407 Time taken: 36.868 seconds, Fetched: 1 row(s) hive> ... ... hive> exit;
Connect R Client to Remote Hive 2 Server
There are three key preparation steps before we can make a remote connection to the Hive server from our R client:
- Identify the TCP Port that Hive is running on
- Identify the remote Host Name / IP Address to connect to the Hive Server
- Copy Java JAR Files into place on the local PC client – and make sure these are available in the R session
Hive TCP Port: By default this is port 10000 – configuring Hive is not my specialty – if it is not set to 10000 consult the manuals!
Remote Host-Name / IP Address: Can be obtained from your Sys Admin, or if that is you – check the output of
ifconfig -a
on the server for the IP address
Copying Java Class Files for R Client Connectivity
These need to be copied from the Hadoop Server to the PC Client.
This can be quite fiddly to get right. The Hive JDBC drivers required by R are:
/usr/lib/hive/lib/hive-jdbc.jar /usr/lib/hive/lib/libthrift-0.9.2.jar /usr/lib/hive/lib/hive-service.jar /usr/lib/hive/lib/httpclient-4.2.5.jar /usr/lib/hive/lib/httpcore-4.2.5.jar /usr/lib/hive/lib/hive-jdbc-standalone.jar /usr/lib/hadoop/client/hadoop-common.jar
The paths specified are the location where they can be found on the CDH 4.5 Hadoop Distribution that I used – this is the same as the Oracle Big Data Lite VM that I tested this against. More details on this are here.
Use Linux tar to copy the files into a tar archive on the server so they can be copied to your PC client. NOTE:
- it is easier for installation on a PC if you strip off the relative path from tar-archive, so I created a shell-script to cd to the local directory before tar-ing up the files
- Some of the files listed are symbolic links; you need to tar up the real file at the end of the symbolic link so use the -h option on tar to do this.
Below is a shell script to create a tar archive on the Hadoop server with all the necessary jar files. For later versions of Hadoop it may be necessary to update the version numbers appended to the file-names:
Preparation – create two files hive_jars_local.lst and hadoop_jars_local.lst. These have contents as follows
hive_jars.lst
hive-jdbc.jar libthrift-0.9.2.jar hive-service.jar httpclient-4.2.5.jar httpcore-4.2.5.jar hive-jdbc-standalone.jar
hadoop_jars.lst
hadoop-common.jar
BASH Shell Script to Collect JAR files
CWD=`pwd` cd /usr/lib/hive/lib tar -cvhf $CWD/hive_jars.tar -T $CWD/hive_jars.lst cd /usr/lib/hadoop/client/ tar -cvhf $CWD/hadoop_jars.tar -T $CWD/hadoop_jars.lst cd $CWD
Save the shell script above and run it on the Hadoop server – it should create two tar archives: hadoop_jars.tar and hive_jars.tar
Use scp to copy these tar files to your PC and then unpack the tar files (using 7-zip or equivalent) in a suitable directory. In this example we are using a Windows PC and they are copied to the directory D:\R\java
Once the JAR files are in place, we can start R-Studio (or the regular R client) and connect to the remote Hive data-source.
Query a Remote Hive Table in R
The following R packages need to be installed and loaded:
library("DBI") library("rJava") library("RJDBC")
NOTE: It may be necessary to add the directory location of the jvm.dll to the Windows PATH environment variable setting for the rJava library to operate correctly.
Once these packages have been installed and loaded, set the classpath in the R session to point to the Hive JAR files installed previously:
cp = c("D:/R/java/hive-jdbc.jar" , "D:/R/java/hadoop-common.jar" , "D:/R/java/libthrift-0.9.2.jar" , "D:/R/java/hive-service.jar" , "D:/R/java/httpclient-4.2.5.jar" , "D:/R/java/httpcore-4.2.5.jar" , "D:/R/java/hive-jdbc-standalone.jar")
.jinit(classpath=cp)
If there are Class-Path errors, use the following rJava call to debug:
.jclassLoader()$setDebug(1L)
Next, connect to the remote Hive service from the R session:
drv = JDBC("org.apache.hive.jdbc.HiveDriver" , "hive-jdbc.jar" ) conn = dbConnect(drv , "jdbc:hive2://bigdatalite:10000/default" , "oracle", "")
In the example above we are connecting to a remote host called “bigdatalite” – this can be replaced by any appropriate hostname or an IP address. The example assumes the Hive service is listening on default port 10000. The Hive user is specified as “oracle”.
Finally, we can access table structures and data in Hive in our local R session:
List available Hive Databases
> databases <- dbGetQuery(conn, "show databases") > databases . database_name 1 default 2 moviedemo 3 moviework
List Tables
> tables <- dbGetQuery(conn, "show tables") > tables . tab_name 1 crime_demo 2 cust 3 media_demo_customer 4 media_demo_movielog 5 movie 6 movie_rating 7 movie_updates 8 movie_view 9 movieapp_log_avro 10 movieapp_log_json 11 movieapp_log_odistage 12 movieapp_log_stage 13 movielog 14 session_stats 15 ukcrimesum 16 user_movie
Describe a Table Structure
> ukcnames <- dbGetQuery(conn, "describe ukcrimesum") > ukcnames col_name data_type comment 1 date string 2 lsoa string 3 lsoa_name string 4 bicycle_theft int 5 burglary int 6 damage_or_arson int 7 drugs int 8 missing_data int 9 other_crime int 10 other_theft int 11 public_order int 12 robbery int 13 shoplifting int 14 social int 15 theft_person int 16 unclassified int 17 vehicle_crime int 18 violence_sex int 19 weapons int 20 total_classified int
Query Hive Data into an R Data-Frame
> Sys.time() [1] "2016-10-13 15:04:47 BST" > ukcrimesum <- dbGetQuery(conn, "select * from ukcrimesum") > Sys.time() [1] "2016-10-13 15:05:35 BST" > dim(ukcrimesum) [1] 2264407 20
So, here we can see 2.2 million rows of data being queried into an R data-frame in under a minute. The query is fast as there is no Hive aggregation being performed. The example shows just a straight pull of all the data, but now we can use a where-clause to pull out the appropriate subset of data that we want to work with.
Thank you so much for this helpful post. For some reason, it did not work for me when I used
drv <- JDBC("org.apache..hive.jdbc.HiveDriver", but when I used
drv <- JDBC("org.apache.hadoop.hive.jdbc.HiveDriver",(with hadoop) it did work. However, I am still unable to execute the "conn" line of code, and am faced with the error that the URL is incorrect, even though I am certain it is correct. Do you have any idea why this might be the case? Thanks in advance.
LikeLike
Hi – you are very welcome! It looks like different versions of Hive have a different hierarchy in the package naming convention. There is a thread on this topic here: https://stackoverflow.com/questions/25157273/connect-from-java-to-hive-using-jdbc
– I wasn’t aware of this issue, so thanks for pointing it out.
I have
# hadoop version
Hadoop 2.6.0-cdh5.7.0
# hive –version
Hive 1.1.0-cdh5.7.0
and
drv = JDBC(“org.apache.hive.jdbc.HiveDriver”, “hive-jdbc.jar”)
works for me. What version of Hadoop and Hive are you on? If you let me know I will update the blog post with the details then.
LikeLike
also, have you done the basics and tried ping-ing the host name from your client? Also, if you do
netstat -tulpn | grep 10000
on the hadoop node you are pointing at, can you see the Hive process listening on your target port
LikeLike
hi site owner can you please instruct the step by step for connecting the Hive with knime. As m not able to get the proper document for that. im using hive-0.13.0 version, Apache hadoop-2.6.0 and the knime latest application Analytics.
It would be a great help if you can do this.
LikeLike
Sorry, I don’t know anything about KNIME, but there are a couple of posts on their support forum pages that look like they will help – i.e. https://www.knime.com/forum/knime-general/how-to-connect-to-hive
LikeLike
Please clear this issue imediately. My error is:
Error in .jcall(drv@jdrv, “Ljava/sql/Connection;”, “connect”, as.character(url)[1], :
java.lang.NoClassDefFoundError: org/apache/hive/service/cli/thrift/TCLIService$Iface
LikeLike
Looks like you have missing Jar file or incorrect class-path setting. Check you have all the correct files copied in place with the right permissions and your path is referencing them. Then use the debug option mentioned in my post to diagnose further. HTH.
LikeLike
Please tell me the correct path where I have to copy my jar files? Whether in Java or rJava. And Class path settings is in environmental variables or in RStudio? I am not sure of these things. Please clarify me… Thanks in advance…
LikeLike
JAR file, can get ‘org/apache/hive/service/cli/thrift/TCLIService$Iface’? NO
>> ClassNotFoundException
I can’t understand this error. I am facing this issue for the past two days. please clarify me.
LikeLike
I got connection to hive from r. But I can’t view the table because it gives the following error.. please tell me about the issue as soon as possible.
Error in .jcall(rp, “I”, “fetch”, stride, block) :
org.apache.hive.service.cli.HiveSQLException: java.io.IOException: java.lang.RuntimeException: serious problem
LikeLike
when I try to write a data frame to hive table. I got the following error
Error in .local(conn, statement, …) :
execute JDBC update query failed in dbSendUpdate (Error while compiling statement: FAILED: ParseException line 1:53 cannot recognize input near ‘.’ ‘year_month’ ‘VARCHAR’ in column type)
R script
> data dbWriteTable(jdbc_con, “year_month_master_tmp”, data, overwrite=TRUE)
any idea?
LikeLike
Sorry, I don’t know the answer to this – will try to investigate further next week – may be a bug?
https://github.com/r-dbi/odbc/issues/36
Please let me know if you make any progress. thanks.
LikeLike
Error in .jfindClass(as.character(driverClass)[1]) : class not found
Getting this error.. pls help me
LikeLike
Sounds like you need to check the setup of your Java class-path OR the necessary Java jar files aren’t in place on your PC. Take a look at the section in the article that discusses this and maybe try the “.jclassLoader()$setDebug(1L)” option to debug further if necessary.
LikeLike
Hi I am getting this error while trying to connect –
Error in .jcall(drv@jdrv, “Ljava/sql/Connection;”, “connect”, as.character(url)[1], :
java.lang.IllegalArgumentException: java.lang.ClassCastException: java.util.Vector cannot be cast to java.util.Hashtable
LikeLike
I’ve not seen this before. Please can you paste the driver-load “drv” spec and connection definition “conn” that you have used.
LikeLike