Thursday, November 29, 2018

Connecting Azure HDInsight Cluster and Hadoop hands on activity






In my previous post I tried to explain the concept of Big Data and HDInsight as well provisioning the cluster on top of Azure cloud. Next, onwards, here we will connect the same cluster and will go for some precise hands on activities like processing Big Data via Hadoop in some different ways.

Since HDInsight provisioning is required as a pre-requisite to complete these tasks so you can either select existing HDInsight cluster or create a new HDInsight cluster. Please refer my previous post to provision a new HDInsight cluster, titled as Big Data and HDInsight Cluster – provisioning on top of Azure cloud.

Here you will cover following Hadoop hands on activity on top of Azure HDInsight Cluster – 
  • Connect an HDInsight Cluster
  • Brows cluster storage
  • Execute commands to explorer HDFS file system
  • Upload and process data files.
  • Run MapReduce jobs using the function etc.


Connecting to an HDInsight Cluster


I have been provisioned an HDInsight cluster already during previous post so moving ahead to log on the Azure portal again to fetch SSH details - https://portal.azure.com/

In fact, I am using Windows based computer so will choose PuTTY application to connect the cluster.

STEP – 1 

Go to Azure portal and select the HDInsight Cluster blade, here click the SSH + Cluster login link under the Settings category, it will load the Cluster Dashboard.



STEP – 2

Next, select the Hostname and it will display the endpoint through which the cluster can be connected.

In my case hostname is something like - HDInsightDemo-ssh.azurehdinsight.net


STEP – 3

Open PuTTY, and in the Session page, enter the host name (the earlier copied hostname) into the Host Name box. Then, under Connection type, select SSH and click Open.



If you get a security alert something that the server’s host key is not cached in the registry and want to connect or abandon the connection, simply click Yes to continue.



STEP – 4

Security alert onwards, when prompted, enter the SSH username and password you specified during provisioning the cluster and make sure to submit the SSH user not a cluster login username. 



Post authorization, you will be connected with HDInsight cluster console, here you can see a couple of few details such as Ububtu 16.04.5 LTS server where on top of this HDInsight cluster is running.


 Congratulation, Azure HDInsight Cluster connected!! 😊

Some hands on activities with HDInsight cluster 


Since you have already opened an SSH console for the created cluster successfully, now you can use it to work with the cluster shared storage system. Here Hadoop uses a file system named HDFS, which in Azure HDInsight clusters is implemented as a blob container in Azure Storage.

Next, time to do some hands on activities on top of the cluster using Hadoop command, keep notes that commands are case-sensitive.

Browse Cluster Storage


Task – 1  

Execute the following command to view the contents of the root folder in the HDFS file system. 

hdfs dfs –ls /
 

Task – 2 

Execute the following command to view the contents of the /example folder in the HDFS file system. This folder contains sub-folders for sample apps, data, and JAR components etc.

hdfs dfs –ls /example



Task – 3

Execute the following command to view the contents of the /example/data/gutenberg folder, which contains sample text files.

hdfs dfs -ls /example/data/gutenberg



Task – 4
Execute the following command to view the text in the davinci.txt file on console.

hdfs dfs -text /example/data/gutenberg/davinci.txt
  
 


You can see the file contains a large volume of unstructured text.

Run a MapReduce Job


MapReduce is a framework through which you can write applications to process huge amounts of data. It is a processing technique and a program model for distributed computing based on Java. 

MapReduce essentially refers to two distinct and diverse tasks that Hadoop programs perform, to distribute the processing of data across nodes in the cluster. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples structured as key - value pairs.

Next the reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. In fact, the sequence of the name MapReduce suggests, the reduce job is always performed after the map job.

Task – 1

Execute the following command to view the sample Java jars stored in the cluster head node.

ls /usr/hdp/current/hadoop-mapreduce-client




Task – 2

Execute the following command to get a list of MapReduce functions available in the hadoop-mapreduce-examples.jar.

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar


Task – 3

Execute the following command to get help for the wordcount function in the hadoop-mapreduce-examples.jar that is stored in the cluster head.

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount


Task – 4

Execute the following command to run a MapReduce job using the wordcount function in the hadoop-mapreduce-examples.jar jar to process the davinci.txt file you already viewed and store the results of the job in the /example/demoresults folder accordingly.

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount /example/data/gutenberg/davinci.txt /example/demoresults

MapReduce job will start to process the wordcount function promptly.


The sooner the MapReduce job will complete, you can see related details which appear on console.

Task – 5

Post MapReduce job completion, execute the following command to view the output folder. You can notice an existence of a file named part-r-00000, that has been created by the job.

hdfs dfs -ls /example/demoresults



Task – 6

Execute the following command to view the results in the output file part-r-00000.

hdfs dfs -text /example/demoresults/part-r-00000



Uploading and Processing Data Files


In the previous hands on activities, you executed a Map Reduce job on a sample file that is provided with HDInsight. Now you will upload data to the Azure blob store and do further activity Hadoop, and then download the results for analysis on your local computer accordingly.

Task – 1

You need a huge volume of file to process this task, either you can create the text file or download the same. I am going to download it, for example, some product reviews – 


Task – 2

Since you need to upload this file to Azure Blob Storage of HDInsight cluster, so you can either use Azure Storage Explorer or go ahead manually via the Azure portal.

Go to the Azure portal and select the HDInsight cluster storage account.


Next, move inside the Blobs.



Post selection of default container, you can see the HDInsight container blade where HDFS files folders will be displayed.


Task – 3

Click the Upload link inside the Container blade and upload the earlier downloaded sample file reviews.txt under a new folder name demofiles.

The rest leave all other options as default selection and proceed to click the Upload button under the Upload blob blade.


Sooner you will get success upload acknowledgement.



Task – 4

Switch to the SSH console for your HDInsight cluster and execute the following commands to list file details there.

hdfs dfs –ls /demofiles
hdfs dfs –text /demofiles/reviews.txt
  




Task – 5

Execute the following command to run a MapReduce job using the wordcount function in the hadoop-mapreduce-examples.jar jar to process the uploaded file reviews.txt and store the results of the job in the /demofiles/results folder.

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount /demofiles/reviews.txt /demofiles/results

Promptly the MapReduce job will start to process the file and sooner it completes the same.




Task – 6

Execute the following command to view the output folder demofiles, and verify that a file named part-r-00000 has been created by the job.

hdfs dfs -ls /demofiles/results



Task – 7

Next move to Azure portal and go inside the HDInsight container and verify the results folder and output files.


Task – 8

Click the part-r-00000 file and you can see a summary detail about the blob, next click the Download link to download the same to your local computer.



Task – 9

The part-r-00000 text file is a tab-delimited file so you can either use a spreadsheet application or a normal text editor to see the word counts.

I am opening the file using the Notepad ++ editor.



Keep visit for further articles ! 👍

1 comment: