Pig Introduction
We have already walked through the Hadoop, HBae and Hive outline and installation over the Windows environment in previous posts. In the context of brief description, we can see the Hadoop can perform only batch processing, and data will be accessed only in a sequential manner, that means required to search the entire data-set even for the single task.
In a different scenario, to ease the process to access any point of data in a single unit of time (random access), the HBase introduced. HBase is an open source non-relational (NoSQL) distributed column-oriented database that runs on top of HDFS and real-time read/write access to those large data-sets that cannot be handled by the Hadoop.
In this sequence, we talked about the Hive, an ETL tool for Hadoop ecosystem, enables developers to write Hive Query Language (HQL) statements very similar to SQL statements. It is a data warehouse software project built on top of Hadoop, that facilitate reading, writing, and managing large datasets residing in distributed storage using SQL.
Now time to talk about Pig, it was initially developed by Yahoo! for its data scientists who were using Hadoop. Pig is a platform for analyzing large sets of data that consists of a high-level language for expressing data analysis programs. It is a data flow language (PigLatin) to write Hadoop operations without using MapReduce Java code.
Importance of Pig
Usage of Pig makes its reputation; few key usages are as follows -
[1] Ease of programming
It is quiet tough for non-programming background aspirant to write and execute complex Java programs for map reduce. Pig makes this process easier, the queries are converted to map reduce internally.
[2] Optimization opportunities
The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
[3] Extensible
User defined function are written in which the user can write their own logic to execute over the data set.
Pig Installation
Pig is a layer of abstraction on top of Hadoop to simplify its use by giving a SQL –like interface to process data on Hadoop. Before moving ahead, it is essential to install Hadoop first, I am considering Hadoop is already installed, if not, then go to my previous post how to install Hadoop on Windows environment.
I went through Pig 0.17.0 version, though you can use any stable version.
Download Pig 0.17.0
I went through Pig 0.17.0 version, though you can use any stable version.
Download Pig 0.17.0
- https://pig.apache.org/
STEP - 1 : Extract the Pig file
Extract file pig-0.17.0.tar.gz and place under "D:\Pig", you can use any preferred location –
[1] You will get again a tar file post extraction –
[1] You will get again a tar file post extraction –
[2] Go inside of pig-0.17.0.tar folder and extract again –
[3] Copy the leaf folder “pig-0.17.0” and move to the root folder "D:\Pig" and removed all other files and folders –
STEP - 2: Configure Environment variable
Set the path for the following Environment variable (User Variables) on windows 10 –
- PIG_HOME - D:\Pig\pig-0.17.0
This PC - > Right Click - > Properties - > Advanced System Settings - > Advanced - > Environment Variables
Variable: Path
[1] Local Mode
Apache Pig scripts can be executed in three ways, namely –
STEP - 3: Configure System variable
Next onward need to set System variable, including Hive bin directory path –
Variable: Path
Value:
- D:\Pig\pig-0.17.0\bin
STEP - 4: Working with Pig command file
Now need to do a cross check with Pig command file for Hadoop executable details –
Pig.cmd
[1] Edit file D:/Pig/pig-0.17.0/bin/pig.cmd, make below changes and save this file.
set HADOOP_BIN_PATH=%HADOOP_HOME%\libexec
STEP - 5: Start the Hadoop
Here need to start Hadoop first -
Open command prompt and change directory to “D:\Hadoop\hadoop-2.8.0\sbin" and type "start-all.cmd" to start apache.
It will open four instances of cmd for following tasks –
- Hadoop Datanaode
- Hadoop Namenode
- Yarn Nodemanager
- Yarn Resourcemanager
It can be verified via browser also as –
- Namenode (hdfs) - http://localhost:50070
- Datanode - http://localhost:50075
- All Applications (cluster) - http://localhost:8088 etc.
Since the ‘start-all.cmd’ command has been deprecated so you can use below command in order wise -
- “start-dfs.cmd” and
- “start-yarn.cmd”
STEP - 6: Validate Pig installation
Post successful execution of Hadoop, change directory to “D:\Pig\pig-0.17.0\bin” and verify the installation.
STEP - 7: Execute Pig (Modes)
Pig has been installed and ready to execute so time to execute, you can run Apache Pig in two modes, namely -
[1] Local Mode
In this mode, all the files are installed and run from your local host and local file system. There is no need of Hadoop or HDFS. This mode is generally used for testing purpose.
[2] HDFS mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.
Apache Pig Execution Mechanisms
Apache Pig scripts can be executed in three ways, namely –
[1] Interactive Mode (Grunt shell)
You can run Apache Pig in interactive mode using the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump operator).
[2] Batch Mode (Script)
[2] Batch Mode (Script)
You can run Apache Pig in Batch mode by writing the Pig Latin script in a single file with .pig extension.
[3] Embedded Mode (UDF)
[3] Embedded Mode (UDF)
Apache Pig provides the provision of defining our own functions (User Defined Functions) in programming languages such as Java, and using them in our script.
STEP - 8: Invoke Grunt shell
Now you can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown below.
Local Mode
pig -x local
pig -x local
MapReduce
pig -x mapreduce
pig -x mapreduce
Either of these commands gives you the Grunt shell prompt as shown below.
Congratulations, Pig installed !!😊
STEP-9: Some hands on activities
[1] Create a text file ‘student.txt’ delimited by ‘,’
[2] Load the file in a variable ‘input_file’
input_file = LOAD 'd:/student.text' USING PigStorage(',')
as (Sr_No: Int, Student_Name:chararray, Student_Location:chararray);
[3] Check result using DUMP operator (write result to the console)
DUMP input_file;
[4] We can execute the pig script in batch mode also.
Step 1 - Write all the required Pig Latin statements in a single file. We can write all the Pig Latin statements and commands in a single file and save it as .pig file.
Step 2 - You can execute it from the Grunt shell as well using the exec command as shown below.
exec /d:/student.pig
Stay in touch for more posts.
Short and to the point. The best part is you have given an example to check PIG is working as expected on Windows. Keep up the good work.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteThanks a lot for this :)
ReplyDeleteHi..
ReplyDeleteI'm new in Apache Pig.
I'm using Windows 10, Java 13, Hadoop 3.1.2, Hbase 2.2.2, Hive 3.1.2, and Pig 0.17.0.
I have followed your all steps.
I have put tools.jar in jdk-13.0.1\lib and pig-0.17.0\bin but Pig still can't locate this tools.jar like this :
Error: Could not find or load main class D:\Program_Files\Java\jdk-13.0.1\lib\tools.jar
Caused by: java.lang.ClassNotFoundException: D:\Program_Files\Java\jdk-13.0.1\lib\tools.jar
I don't know how I can fix this anytime. Your help is really appreciated. :)
Thank you
ReplyDeleteThis comment has been removed by the author.
ReplyDeletesa.pig
ReplyDeletedata1 = LOAD '/e:/stud.txt' USING PigStorage(',') as (sid:int,sname:chararray);sord = ORDER data1 BY snd DESC;
slim = LIMIT sord 2;
DUMP slim;
grunt> exec e:/sam/sa.pig
2021-04-12 19:30:20,843 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2021-04-12 19:30:20,867 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2021-04-12 19:30:21,177 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias slim
Details at logfile: C:\hadoop\logs\pig_1618233824706.log
what is the error?
please help me