TALK GRID (Cloud Trainer): Pig installation on Windows 10

Thursday, November 15, 2018

Pig installation on Windows 10

Pig Introduction

We have already walked through the Hadoop, HBae and Hive outline and installation over the Windows environment in previous posts. In the context of brief description, we can see the Hadoop can perform only batch processing, and data will be accessed only in a sequential manner, that means required to search the entire data-set even for the single task.

In a different scenario, to ease the process to access any point of data in a single unit of time (random access), the HBase introduced. HBase is an open source non-relational (NoSQL) distributed column-oriented database that runs on top of HDFS and real-time read/write access to those large data-sets that cannot be handled by the Hadoop.

In this sequence, we talked about the Hive, an ETL tool for Hadoop ecosystem, enables developers to write Hive Query Language (HQL) statements very similar to SQL statements. It is a data warehouse software project built on top of Hadoop, that facilitate reading, writing, and managing large datasets residing in distributed storage using SQL.

Now time to talk about Pig, it was initially developed by Yahoo! for its data scientists who were using Hadoop. Pig is a platform for analyzing large sets of data that consists of a high-level language for expressing data analysis programs. It is a data flow language (PigLatin) to write Hadoop operations without using MapReduce Java code.

Importance of Pig

Usage of Pig makes its reputation; few key usages are as follows -

[1] Ease of programming

It is quiet tough for non-programming background aspirant to write and execute complex Java programs for map reduce. Pig makes this process easier, the queries are converted to map reduce internally.

[2] Optimization opportunities

The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.

[3] Extensible

User defined function are written in which the user can write their own logic to execute over the data set.

Pig Installation

Pig is a layer of abstraction on top of Hadoop to simplify its use by giving a SQL –like interface to process data on Hadoop. Before moving ahead, it is essential to install Hadoop first, I am considering Hadoop is already installed, if not, then go to my previous post how to install Hadoop on Windows environment.

I went through Pig 0.17.0 version, though you can use any stable version.

Download Pig 0.17.0

https://pig.apache.org/

STEP - 1 : Extract the Pig file

Extract file pig-0.17.0.tar.gz and place under "D:\Pig", you can use any preferred location –
[1] You will get again a tar file post extraction –

[2] Go inside of pig-0.17.0.tar folder and extract again –

[3] Copy the leaf folder “pig-0.17.0” and move to the root folder "D:\Pig" and removed all other files and folders –

STEP - 2: Configure Environment variable

Set the path for the following Environment variable (User Variables) on windows 10 –

PIG_HOME - D:\Pig\pig-0.17.0

This PC - > Right Click - > Properties - > Advanced System Settings - > Advanced - > Environment Variables

STEP - 3: Configure System variable

Next onward need to set System variable, including Hive bin directory path –

Variable: Path

Value:

D:\Pig\pig-0.17.0\bin

STEP - 4: Working with Pig command file

Now need to do a cross check with Pig command file for Hadoop executable details –

Pig.cmd

[1] Edit file D:/Pig/pig-0.17.0/bin/pig.cmd, make below changes and save this file.

set HADOOP_BIN_PATH=%HADOOP_HOME%\libexec

STEP - 5: Start the Hadoop

Here need to start Hadoop first -

Open command prompt and change directory to “D:\Hadoop\hadoop-2.8.0\sbin" and type "start-all.cmd" to start apache.

It will open four instances of cmd for following tasks –

Hadoop Datanaode
Hadoop Namenode
Yarn Nodemanager
Yarn Resourcemanager

It can be verified via browser also as –

Namenode (hdfs) - http://localhost:50070
Datanode - http://localhost:50075
All Applications (cluster) - http://localhost:8088 etc.

Since the ‘start-all.cmd’ command has been deprecated so you can use below command in order wise -

“start-dfs.cmd” and
“start-yarn.cmd”

STEP - 6: Validate Pig installation

Post successful execution of Hadoop, change directory to “D:\Pig\pig-0.17.0\bin” and verify the installation.

STEP - 7: Execute Pig (Modes)

Pig has been installed and ready to execute so time to execute, you can run Apache Pig in two modes, namely -

[1] Local Mode

In this mode, all the files are installed and run from your local host and local file system. There is no need of Hadoop or HDFS. This mode is generally used for testing purpose.

[2] HDFS mode

MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.

Apache Pig Execution Mechanisms

Apache Pig scripts can be executed in three ways, namely –

[1] Interactive Mode (Grunt shell)

You can run Apache Pig in interactive mode using the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump operator).

[2] Batch Mode (Script)

You can run Apache Pig in Batch mode by writing the Pig Latin script in a single file with .pig extension.

[3] Embedded Mode (UDF)

Apache Pig provides the provision of defining our own functions (User Defined Functions) in programming languages such as Java, and using them in our script.

STEP - 8: Invoke Grunt shell

Now you can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown below.

Local Mode
pig -x local

MapReduce
pig -x mapreduce

Either of these commands gives you the Grunt shell prompt as shown below.

Congratulations, Pig installed !!😊

STEP-9: Some hands on activities

[1] Create a text file ‘student.txt’ delimited by ‘,’

[2] Load the file in a variable ‘input_file’

input_file = LOAD 'd:/student.text' USING PigStorage(',')

as (Sr_No: Int, Student_Name:chararray, Student_Location:chararray);

[3] Check result using DUMP operator (write result to the console)

DUMP input_file;

[4] We can execute the pig script in batch mode also.

Step 1 - Write all the required Pig Latin statements in a single file. We can write all the Pig Latin statements and commands in a single file and save it as .pig file.

Step 2 - You can execute it from the Grunt shell as well using the exec command as shown below.

exec /d:/student.pig

Stay in touch for more posts.

7 comments:

Himanshu AroraAugust 25, 2019 at 6:21 AM
Short and to the point. The best part is you have given an example to check PIG is working as expected on Windows. Keep up the good work.
ReplyDelete
Replies
Shwetabh DixitNovember 5, 2019 at 10:58 AM
This comment has been removed by the author.
ReplyDelete
Replies
Shwetabh DixitNovember 5, 2019 at 10:58 AM
Thanks a lot for this :)
ReplyDelete
Replies
Bryan Monang W. S.November 11, 2019 at 12:09 PM
Hi..
I'm new in Apache Pig.
I'm using Windows 10, Java 13, Hadoop 3.1.2, Hbase 2.2.2, Hive 3.1.2, and Pig 0.17.0.
I have followed your all steps.
I have put tools.jar in jdk-13.0.1\lib and pig-0.17.0\bin but Pig still can't locate this tools.jar like this :

Error: Could not find or load main class D:\Program_Files\Java\jdk-13.0.1\lib\tools.jar
Caused by: java.lang.ClassNotFoundException: D:\Program_Files\Java\jdk-13.0.1\lib\tools.jar

I don't know how I can fix this anytime. Your help is really appreciated. :)
ReplyDelete
Replies
UnknownFebruary 21, 2021 at 6:16 PM
Thank you
ReplyDelete
Replies
DeepikaApril 12, 2021 at 7:33 PM
This comment has been removed by the author.
ReplyDelete
Replies
DeepikaApril 12, 2021 at 7:37 PM
sa.pig

data1 = LOAD '/e:/stud.txt' USING PigStorage(',') as (sid:int,sname:chararray);sord = ORDER data1 BY snd DESC;
slim = LIMIT sord 2;
DUMP slim;

grunt> exec e:/sam/sa.pig
2021-04-12 19:30:20,843 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2021-04-12 19:30:20,867 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2021-04-12 19:30:21,177 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias slim
Details at logfile: C:\hadoop\logs\pig_1618233824706.log

what is the error?
please help me
ReplyDelete
Replies

Add comment

Pages