TALK GRID (Cloud Trainer): Big Data and HDInsight Cluster

Big Data – Introduction

In the recent IT world, the Big Data, is the latest watchword. In fact, the Big Data is a term which is being used for a collection of large and complex data sets. Here these massive volumes of data are so fused as well chaotic and difficult to store and process through the traditional data processing applications or database management tools.

Big Data is universally these days, some tremendous real-life big data examples -

Retail companies used to handle millions of customer transaction every hour.
It is believed that a single Jet engine can generate 10 terabytes plus of data within thirty minutes of a flight duration.
If I talk about the social media like Facebook, then the statistic indicates 500 TB plus of new data gets ingested into the databases as well about 230 plus millions of tweets are created every day.
Present cars have nearby 100 sensors which monitor fuel level, tire pressure, etc. and generates a lot of sensor data.
Stock exchanges like New York Stock Exchange itself generates about one terabyte of new trade data per day.

Big Data – Sources

In reality, whenever someone opens an application on their phones, surfing on the internet, searching something specific in a search engine and many more usual activities, a piece of data is gathered. In brief, following are the major sources of big data –

Social networking sites like Facebook, Google, LinkedIn, Twitter etc.
E-commerce site like Amazon, Flipkart etc.
Weather Station like India Meteorological Department.
Telecom company like Airtel, Vodafone, Idea etc.
Share Market like NSE, BSE etc.
Medical records, etc.

Big Data – Categories

Big Data could be of three types –

[1] Structured

Any type of data which can be stored, accessed as well processed in a pre-defined format is called as Structured Data. For example – RDBMS is one of the best example of structured data where we can manage data with the help different schema and process the same using SQL language.

[2] Unstructured

Any data which form or structure are unknown and cannot be stored in RDBMS as well not easy for analysis, classified as Unstructured Data. For example – heterogeneous data sources, a result of Google Search returns a combination of text files, images, audios and video etc.

[3] Semi – structured

It is a specific type of data where data do not have a formal structure in term of RDBMS, but it has some organizational properties and elements enabled. For example – an API generates either XML or JSON output.

Big Data – Four V’s

Some precise terms known as four V’s are associated with big data that actually define the characteristics and help to make definition even better about the big data.

[1] Velocity

Velocity is one of the major characteristics of big data which defines as pace, the speed of generation of data where different sources generate it every day. It deals with the speed at which data flows in from sources and processed to meet the demands, determines real potential in the data.

[2] Volume

Here Volume refers to the massive amount of data which used to grow day by day at a very fast pace from a variety of sources, including business transactions, social media and sensor facts or machine-to-machine data.

[3] Veracity

Doubtful data or uncertainty of data leads to data inconsistency and incompleteness which is in fact called as Veracity. In order to be of worth of big data in the context of an organization, make sure it should be correct.

[4] Variety

Here Variety refers to assorted sources and nature of data that belongs to structured, unstructured and semi-structured. In brief big data can be varied, the data can exist in different forms of images, audios, videos, and sensor data etc.

Azure HDInsight and Hadoop cluster – Introduction

Azure HDInsight is a Hadoop service offering from the Hortonworks Data Platform (HDP) and hosted on top of Azure cloud. It is a cloud based fully managed, full-spectrum and open-source analytics service for enterprises to process massive amounts of data. HDInsight also supports a broad range of scenarios, such as batch processing in term of extract, transform, and load (ETL), data warehousing, machine learning, internet of things (IoT) and data science etc.

Apache Hadoop is an open source distributed data processing cluster that uses HDFS, YARN resource management, and a simple MapReduce programming model to process and analyze batch data in a parallel format.

You can visit my previous post to do some hands on activity with Hadoop on top of Windows 10.

Azure HDInsight deploys and provisions Apache Hadoop clusters on top of Azure cloud, providing a software framework designed to manage, analyze, and report on big data with high availability and utilization.

Next, onward you will provision an HDInsight cluster and run a sample MapReduce job on the cluster and check the results.

Provisioning and Configuring an HDInsight Cluster

Pre-requisites

Azure subscription, if you don't have an account then sign up for a free Azure account - https://azure.microsoft.com/en-gb/free/
PuTTY
Sample file contains a huge amount of data.

STEP - 1

Click ‘+ Create a resource’ from the left hand menu, you will get Analytics category under the Azure marketplace tab. Click the HDInsight link under Featured category –

Post selection of HDInsight, a new blade to create a new HDInsight cluster will be appearing with particular categories –

STEP - 2

On the Basics tab, make sure the correct subscription is selected and submit appropriate details as follows –

Cluster Name – Enter a unique name.
Subscription – Select your Azure subscription.
Cluster Type – Hadoop.
Operating System – Linux.
Version – Select the default one, most probably it is the latest version of Hadoop.
Cluster Login Username – Enter a user name of your choice.
Cluster Login Password – Enter a strong password.
SSH Username – Enter another user name of your choice (to access the cluster remotely).
SSH Password – You can use the same above password.

Next, choose either an existing Resource group or can create a new using Create new. I am going with earlier created resource group ‘rajResource’.

Later on I went through the default data center location as East US 2, and clicked the Next button.

STEP - 3

Post click on Next button; the Storage blade will be appearing where you need to submit following details as –

Primary storage type – Azure Storage
Selection Method – My Subscriptions
Storage account – Either select an existing one or create a new storage account.
Default Container – Enter a new name or go with default selection with cluster name.

Apart from this leave the rest two options with default Optional selection for Additional storage accounts and Data Lake Storage Gen1 access respectively.

Even no need to pass any input for the Metastore Settings, it is an optional setting so you can go and click the Next button.

STEP - 4

Post click on Next button, the Cluster summary blade will be appeared and post validation success you can see the details about the HDInsight cluster you are about to create.

In fact, Azure HDInsight Clusters billed on a per minute basis, clusters run a group of nodes depending on the component. These Nodes vary by group for example Worker Node, Head Node, quantity, etc. so we choose the smaller available size for demo purpose.

Visit the Microsoft Azure official pricing page for more details. I went with two Worker Node and clicked the Next button –

Post selection of Next button you will get Script actions blade, just leave with default selection since an optional input and click the Next button herewith –

STEP - 5

Next, the Cluster summary blade will be appearing again and post validation success you can see the details about the HDInsight cluster you are about to create.

If you feel each information is fine and ready, then click the Create button. It will take a while the cluster to be provisioned and status to show as Running (good time to have a cup of tea! ☕).

Note: As soon as an HDInsight cluster is running, the credit in your Azure subscription will start to be charged. Henceforth post demo lab, do not forget to clean up the resources to avoid using your Azure credit unnecessarily.

Sooner you will get a notification once the HDInsight cluster provisioned successfully.

Congratulation, HDInsight Cluster is deployed!! 😊, time to connect the cluster.

View Cluster details in the Azure Portal

In the Microsoft Azure portal, select your resources and move to the HDInsight Cluster blade, here the summary of your created cluster will be appeared.

On the HDInsight Cluster blade, you can change the size setting also, such as scale the number of worker nodes to meet processing demand.

Cluster dashboards

Apache Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. It is a fully open-source, Apache project and graphical interface to Hadoop. You can explore the dashboard for your cluster using this web application.

Click the Ambari home link under the Cluster dashboard section, it will redirect to cluster portal and ask to log on. Make sure to provide a cluster user name and password not SSH user name.

For example, in my case – https://hdinsightdemo.azurehdinsight.net/

Post successful login, the web application will display the dashboard of HDInsight cluster where you can see all running Big Data components and details –

Connecting to an HDInsight Cluster

HDInsight cluster has been provisioned as well you went through the dashboard also using Ambari web application. Now time to connect the HDInsight cluster using SSH client such as PuTTY.

Since I am using Windows based computer so will PuTTY application to connect the cluster.

STEP – 1

Go to Azure portal and select the HDInsight Cluster blade, here click the SSH + Cluster login link under the Settings category, it will load the Cluster Dashboard.

STEP – 2

Next, select the Hostname and it will display the endpoint through which the cluster can be connected.

In my case hostname is something like - HDInsightDemo-ssh.azurehdinsight.net

STEP – 3

Open PuTTY, and in the Session page, enter the host name (the earlier copied hostname) into the Host Name box. Then, under Connection type, select SSH and click Open.

If you get a security alert something that the server’s host key is not cached in the registry and want to connect or abandon the connection, simply click Yes to continue.

STEP – 4

Security alert onwards, when prompted, enter the SSH username and password you specified during provisioning the cluster and make sure to submit the SSH user not a cluster login username.

Post authorization, you will be connected with HDInsight cluster console, here you can see a couple of few details such as Ububtu 16.04.5 LTS server where on top of this HDInsight cluster is running.

Congratulation, Azure HDInsight Cluster connected!! 😊

In the next post we will connect the same cluster and will process Big Data via Hadoop something precise hands on activities like –

Brows cluster storage
Run a MapReduce job
Upload and process data files etc.

Stay in touch!! 😊

Pages

Wednesday, November 28, 2018

Big Data and HDInsight Cluster – provisioning on top of Azure cloud