Oracle Free Cloud Trial – Part 3: Hadoop

This is Part 3 of my series of four blog articles recording my experiences trying out the Oracle Cloud using the 30 day / $300 Free Trial offer.  In this blog-post I cover provisioning a pre-configured Hadoop service in the Oracle Cloud.

Oracle provide several Big-Data-in-the-Cloud service offerings including:

  • Big Data (enterprise Big Data based on the Cloudera Stack,  private cloud or public cloud)
  • Big Data Cloud (packaged Hadoop analytics platform based on a Hortonworks fork).  NOTE: This was previously known as “Oracle Big Data Cloud Service – Compute Edition” and then the name was changed to just Big Data Cloud
  • Event Hub (a Kafka platform service)

This blog-post focuses on Oracle Big Data Cloud.  The base product and configuration steps are well documented here:

https://docs.oracle.com/en/cloud/paas/big-data-compute-cloud/csspc/getting-started-big-data-cloud-service-compute-edition.html

…but navigation for some of the network enablement and service configuration require a bit of jumping around various menus and GUI screens.  The intention of this blog is to make this easier with a highlighted step-by-step simple guide.

By default ssh network access to the cluster nodes is disabled; enabling ssh access is covered towards the end of this article.

There is a choice of admin portals: Ambari or an Oracle-specfic admin portal – both are briefly covered.

I noticed that Sqoop is not installed by default (I installed it manually myself – this is not covered in this article), and Hive is not enabled for JDBC access on port 10000 (instead you can access it via Zeppelin Notebooks, which I will cover next in the final Part Four of these blog articles).

In summary I found this a very useful pre-packaged Hadoop platform for Datascience tasks using the Zeppelin notebooks. If you want a broader set of classic Hadoop and Big Data tools to deploy a full enterprise Big Data processing platform, then there are a lot of manual configuration and installation steps required – it might be worth looking at the other Cloudera-based Oracle Big Data offering for this situation. However, the tight integration between Kafka for data feeds in, Oracle Database Cloud and MySQL Cloud for RDBMS data-sources and Zeppelin Notebooks for analysis and reporting make this a very useful packaged Big Data Datascience platform.

The sections covered in this article are as follows:

  • Sign In to Cloud Portal
  • Create Hadoop PaaS Service – Oracle Big Data Cloud
  • Cluster Administration Portals
  • SSH Access to Cluster Nodes
  • Network Access for Other Services
  • Billing

Sign In to Cloud Portal

Login in to the Oracle Cloud Services Portal using your Cloud Account Nameand email address that you registered with the Oracle cloud free trial service, as described in Part 1.

  1. Go to http://cloud.oracle.com
  2. Click Sign In at the top of the screen
  3. Select Cloud Account With Identity Cloud Service (not “Traditional Cloud Account”)
  4. Sign in with your Cloud Account Name (not email address)
  5. Click My Services to log in
  6. At the next login screen you do use your Email Address to login.
  7. The main Cloud Services Portal should now appear and you can manage your Oracle cloud services.

Create Hadoop PaaS Service – Oracle Big Data Cloud

From the main Cloud Services Portal, access the Dashboard – this is usually the default login starting point – if not, Click on the Dashboard icon

dashboard_link

in the top-right menu-bar area to see a summary of options available and your cloud credits status.

  1. Add the Big Data Cloud (called “Big Data Compute Edition” at the time the screen shot was taken) to your dashboard for convenience; Select Customize Dashboard and then click Show1_add_BDCS_dashboard
  2. Click the new Big Data Compute Edition link on your dashboard, then click “Open Service Console”: 2_open_service_console
    Skip the next screen offering a training video and click “Go To Console” to open the Big Data Cloud Console: 2b_console
  3. There are no existing services, so click the “Create Service” button.  3_create_service
  4. Go through a 3-step process to provision a cluster (this is also documented here):
    • Step 1 – “Service” – Give the Hadoop cluster service a name and set the physical region for the cluster to be located in: 4_create_service_1
    • Step 2 – “Details” – provide the detailed configuration information for the new Hadoop cluster: 4_create_service_2
      • I chose a 3 node cluster with full install
      • Select basic authentication, not Oracle IDCS (Identity cloud services)
      • Create a new storage device – in this example called “HadoopClust1_Storage” (check the “Create Cloud Storage Container” check-box to enable this to be created)
    • NOTE: the naming convention for the Storage Container is very specific and  must be in the format “<schema name>/<container name>“, EG:Storage-ebullen/HadoopClust1Storage, where ebullen is my identity domain (and username) and Storage  has a capital “S”

Also I had to allow unsecure scripts to run in the browser: 4_create_service_2b

      • Note the Associations options available for connectivity to other Oracle Cloud Services (I didn’t use any of these in this example, but they could be added on later)
        • Database Cloud Service
        • MySQL Cloud Service
        • Event Hub Cloud Service
  • Step 3 – “Confirm” – check the details and submit to create the cluster:  4_create_service_3
  • Once the service is created, view the Service Console and click on the name of the new Hadoop Cluster service:
    5_service_created_1
  • In the Service Overview page for this Hadoop Cluster, note the details of the cluster nodes that were created, their IP addresses and the Ambari Admin interface IP address: 5_service_created_2

Cluster Administration Portals

There is a choice of two administration portals for managing the Oracle Big Data Cloud Hadoop Cluster.  Either the standard Apache Ambari administration and management portal can be used, or the Oracle Big Data Cloud Console can be used.

Ambari Console

To use Ambari (configured on TCP port 8080), you need to enable port 8080 to be allowed through the fire-wall.  See the Accessing Big Data Cloud Using Ambari documentation for details on how to do this.

Once port 8080 is able to connect to the Hadoop cluster node that is running Ambari, connect  to the Ambari IP address with port 8080 using a web-browser.  NOTE: the connection only works via https.

Login with the BDCS Admin username and password set previously.  The default user is bdcsce_admin.

6_launch_Ambari

Oracle Big Data Cloud Console

As an alternative to using Ambari, connect to the Oracle BDCS Console web address https://<IP-addr>:1080/and use the same BDCS Admin username and password set previously (the default user is bdcsce_admin).   here is no need to open any ports for this (1080 is open by default).

7_launch_Oracle_BDCSE_Console

SSH Access to Cluster Nodes

Network access for direct SSH shell login to the cluster nodes is disabled by default.  For further details on network access rules see the documentation.

To enable SSH access to the cluster nodes so that you can connect as UNIX user opc ,

Go to the BDCS Cloud Console and select the “hamburger menu” at the top and Access Rules (see picture below): 8_ssh_access_1Then, enable the ora_p2bdcsce_ssh rule:

8_ssh_access_2Use ssh (or Putty ssh) as user opc to conect to one of the Hadoop cluster nodes and then sudoto user oracle or root as appropriate for the task being performed

For HDFS command line operations, (hdfs dfs -put / hdfs dfs -get etc) try working as the user oracle to get started as the opc user is not set up with an HDFS directory by default.

Network Access for Other Services

For any network services beyond HTTP and SSH and the administration consoles needs to be manually configured by creating a new network rule, as documented here.

10_new_net_rule1Create a rule allowing traffic Source from the PUBLIC-INTERNET for a given port / port-range to a Destination class (you may need to create multiple rules for the same port to different destinations in the cluster): 10_new_net_rule2

Billing

When you have finished using the cluster, don’t forget to shutdown the service at the console to avoid being billed for it.

The process for stopping and starting the whole cluster is documented here.

From the documentation:

When you stop a cluster, you can’t access the cluster and you can’t perform management operations on it except to start the cluster or delete it. Stopping a cluster is like pausing it. You won’t be billed for compute resources, but you will be billed for storage

The current prices for the service are listed at https://cloud.oracle.com/en_US/big-data-cloud/pricing

After approximately 1 day, my usage billing looked like this:

9_billing_1day

Where B88307 is the billing for Compute and B88306 is billing for storage.  The compute element gets charged at a much higher rate, so it makes sense to shut down the cluster whenever possible.

Leave a comment