Oracle Free Cloud Trial – Part 4: Zeppelin Notebooks

This is Part 4 of my series of four blog articles about trying the Oracle Cloud using the 30 day / $300 Free Trial offer.  In this article, it is assumed that an Oracle Big Data Cloud Service has already been provisioned as described in Part 3 of this blog series.

When the Oracle Big Data Cloud environment is provisioned a working Zeppelin Notebook service is automatically configured as part of the service.

Note – Oracle Big Data Cloud is based on an Oracle Hortonworks-fork and is different from the similarly named Cloudera-based Oracle Big Data [in the cloud] service.  Hopefully the product names will become clearer and more distinct in the future.

This is not meant to be a comparison between Jupyter Hub and Zeppelin Notebooks, but clearly the two technologies are catering for the same requirement, providing a multi-user data-science / analytics environment.  In my opinion the main advantage of Zeppelin Notebooks versus Jupyter is the ability to easily execute different command types within a single notebook.  Jupyter Hub probably has a bigger user-base, however and is becoming the defacto interface for Python users.

Log In to the Zeppelin Console

To log in to the Zeppelin console as the main administration user, go to the following URL in a web-browser:

https://:1080/#notebook/

where is the public IP address of one of the cluster nodes.

Connect as the user bdcsce_admin, using the admin password previously set up when the cluster was provisioned (as in Part 3).

For details on setting up multiple Zeppelin users , see here: https://zeppelin.apache.org/docs/0.6.0/security/shiroauthentication.html and here: https://shiro.apache.org/tutorial.html

Zeppelin Main Console

Once connected and logged-in the Apache Zeppelin main console looks something like this: 1_Zepplin_Main_ScreenFrom this console screen you can either Create a New Note or Import a Note by selecting the appropriate button as highlighted above.

Zeppelin Quickstart Tutorial

I have created some basic 101-Zeppelin notebooks for an easy introduction to Zeppelin notebooks.  There are two notebooks, Tutorial 1 Notebook Basics  and Tutorial 2 Working With Hive covering the following operations:

  • Basic navigation around the Zeppelin notebook web interface
  • Different interpreters available: this includes Scala, Python, R, SQL and R as part of the Oracle Zeppelin server default install.
  • combining UNIX Shell commands and other commands such as Hive, Python etc in a single notebook
  • A simple Hive example working against a sample set of CitiBike Bike Trip Data
  • staging data into HDFS from a remote web destination from within the Zeppelin notebook interface.

The notebooks can be cloned from here:

https://github.com/edbullen/ZeppelinTutorial

and then loaded into Zeppelin by selecting the “Import Note” button as shown on the previous screenshot.

To run the full tutorial, please download or clone the Zeppelin Notebooks – alternatively, edited highlights are listed below in the following two sub-sections.

Tutorial 1 Highlights

A Zeppelin note book is made up of multiple cells which have code that can be executed as series of commands.  The output is displayed below the code for each cell.  An example with Markdown mark-up and the resultant output in a single cell is shown below:

2_zeppelin_editor_cell

In the top right of each cell there is a line of icons to control actions against code / text in this cell.  A subset of these are highlighted with arrows 1,2,3 in the image above where:

  • Arrow 1 highlights the “Run” control to run the block of code
  • Arrow 2 highlights the “Show/Hide” toggle to determine if the cell code is displayed (if not, only the output results are shown)
  • Arrow 3 highlights the “Cell Settings” control, allowing multiple actions including re-sizing the cell, moving its position in the notebook, inserting a new cell and clearing the output.

Each cell in the notebook allows you to specify the type of command to execute (i.e. UNIX shell, Markdown, Python etc).  The % symbol signifies the type of interpreter to use run the code  that follows, EG:

  • Markdown: %md
  • Shell: %sh
  • Spark with Scala: %spark
  • Spark with Python: %pyspark
  • Spark SQL: %sql
  • Spark with R: %r
  • JDBC: %jdbc(hive) – use this for Hive queries.

EG 1 – Markdown

%md
# This is a Markdown Heading #
This is some text

EG 2 – Shell Commands

%sh
echo "running whoami"
whoami
echo ".."
echo "running pwd"
pwd

Tutorial 2 Highlights

EG 1 – Hive Query

%jdbc(hive)
select * from my_table;

EG 2 – Sourcing Data and Staging in HDFS

  • Download the Bike Data and stage it in /user/zeppelin/data/201612-citibike-tripdata
  • Use UNIX wget to a pipe, pipe to unzip, then pipe unzipped stream straight into HDFS.
%sh
hdfs dfs -mkdir ./data
hdfs dfs -rm ./data/201612-citibike-tripdata.csv
wget -qO- https://s3.amazonaws.com/tripdata/201612-citibike-tripdata.zip \
   | gunzip - \
   | hdfs dfs -put - ./data/201612-citibike-tripdata.csv

EG 3 – Load Data into Hive Table from HDFS

%jdbc(hive)

drop table if exists bike_trips;

CREATE TABLE bike_trips (
TripDuration int,
StartTime timestamp,
StopTime timestamp,
StartStationID string,
StartStationName string,
StartStationLatitude string,
StartStationLongitude string,
EndStationID string,
EndStationName string,
EndStationLatitude string,
EndStationLongitude string,
BikeID int,
UserType string,
BirthYear int,
Gender int
)
ROW FORMAT delimited
FIELDS TERMINATED BY ',' ;

LOAD DATA INPATH '/user/zeppelin/data/201612-citibike-tripdata.csv'
into table bike_trips;


EG 4 –  Hive Query to Show a Chart with Stacks or Groupings 

%jdbc(hive)

select dayofweek, gender, count(*) number_trips
from (select CONCAT(date_format(`StartTime`,"u")
       , '_'
       , date_format(`StartTime`,"E")
     ) dayofweek,
 case when gender=1 then 'Male'
      when gender=2 then 'Female'
      else 'Unknown' end gender
from bike_trips) bike_times
where dayofweek is not null
group by gender, dayofweek
sort by dayofweek asc

This generates a chart in-line in the note-book as follows:

bike_trips_byDay

Use the settings link in the notebook cell (shown above) to adjust which data columns get included in the chart and use the chart-type buttons to specify the type of chart required.

Acknowledgments

These short tutorial examples are based on notebooks created by David Bayard and others at Oracle:

https://github.com/oracle/learning-library/blob/master/workshops/journey2-new-data-lake/LabGuide200.md

More advanced examples

A more advanced series of lab. tutorials and a sequence of Zeppelin notebooks can be found here:

https://oracle.github.io/learning-library/workshops/journey2-new-data-lake/

These tutorials cover Spark, Spark SQL, geo-mapping and streaming data into the Spark cluster using Oracle Event Hub Cloud Service (“OEHCS” AKA Kafka).

Leave a comment