Thursday, November 15, 2018

Pig installation on Windows 10


Apache Pig


Pig Introduction


We have already walked through the Hadoop, HBae and Hive outline and installation over the Windows environment in previous posts. In the context of brief description, we can see the Hadoop can perform only batch processing, and data will be accessed only in a sequential manner, that means required to search the entire data-set even for the single task.

In a different scenario, to ease the process to access any point of data in a single unit of time (random access), the HBase introduced. HBase is an open source non-relational (NoSQL) distributed column-oriented database that runs on top of HDFS and real-time read/write access to those large data-sets that cannot be handled by the Hadoop. 

In this sequence, we talked about the Hive, an ETL tool for Hadoop ecosystem, enables developers to write Hive Query Language (HQL) statements very similar to SQL statements. It is a data warehouse software project built on top of Hadoop, that facilitate reading, writing, and managing large datasets residing in distributed storage using SQL.

Now time to talk about Pig, it was initially developed by Yahoo! for its data scientists who were using Hadoop. Pig is a platform for analyzing large sets of data that consists of a high-level language for expressing data analysis programs. It is a data flow language (PigLatin) to write Hadoop operations without using MapReduce Java code. 

Importance of Pig


Usage of Pig makes its reputation; few key usages are as follows - 

[1] Ease of programming
It is quiet tough for non-programming background aspirant to write and execute complex Java programs for map reduce. Pig makes this process easier, the queries are converted to map reduce internally.

[2] Optimization opportunities
The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.

[3] Extensible
User defined function are written in which the user can write their own logic to execute over the data set.

Pig Installation


Pig is a layer of abstraction on top of Hadoop to simplify its use by giving a SQL –like interface to process data on Hadoop. Before moving ahead, it is essential to install Hadoop first, I am considering Hadoop is already installed, if not, then go to my previous post how to install Hadoop on Windows environment.

I went through Pig 0.17.0 version, though you can use any stable version.

Download Pig 0.17.0
  • https://pig.apache.org/

STEP - 1 : Extract the Pig file


Extract file pig-0.17.0.tar.gz and place under "D:\Pig", you can use any preferred location –
[1] You will get again a tar file post extraction – 


Pig initial file

[2] Go inside of pig-0.17.0.tar folder and extract again – 

Extract the file

[3] Copy the leaf folder “pig-0.17.0” and move to the root folder "D:\Pig" and removed all other files and folders – 

Local Pig Folder

Extracted Pig Files


STEP - 2: Configure Environment variable


Set the path for the following Environment variable (User Variables) on windows 10 – 
  • PIG_HOME - D:\Pig\pig-0.17.0
This PC - > Right Click - > Properties - > Advanced System Settings - > Advanced - > Environment Variables 

Environment variables

STEP - 3: Configure System variable



Next onward need to set System variable, including Hive bin directory path – 

Variable: Path 
Value: 
  • D:\Pig\pig-0.17.0\bin
System Variable

STEP - 4: Working with Pig command file


Now need to do a cross check with Pig command file for Hadoop executable details – 
Pig.cmd


[1] Edit file D:/Pig/pig-0.17.0/bin/pig.cmd, make below changes and save this file.
set HADOOP_BIN_PATH=%HADOOP_HOME%\libexec

Pig.cmd

STEP - 5: Start the Hadoop


Here need to start Hadoop first - 

Open command prompt and change directory to “D:\Hadoop\hadoop-2.8.0\sbin" and type "start-all.cmd" to start apache.

Start Hadoop


It will open four instances of cmd for following tasks – 
  • Hadoop Datanaode
  • Hadoop Namenode
  • Yarn Nodemanager
  • Yarn Resourcemanager
Hadoop Started

It can be verified via browser also as – 
  • Namenode (hdfs) - http://localhost:50070 
  • Datanode - http://localhost:50075
  • All Applications (cluster) - http://localhost:8088 etc.
Hadoop In Browser

Since the ‘start-all.cmd’ command has been deprecated so you can use below command in order wise - 
  • “start-dfs.cmd” and 
  • “start-yarn.cmd”

STEP - 6: Validate Pig installation


Post successful execution of Hadoop, change directory to “D:\Pig\pig-0.17.0\bin” and verify the installation.

Pig validation

Pig Version

STEP - 7: Execute Pig (Modes)


Pig has been installed and ready to execute so time to execute, you can run Apache Pig in two modes, namely -

[1] Local Mode 
In this mode, all the files are installed and run from your local host and local file system. There is no need of Hadoop or HDFS. This mode is generally used for testing purpose.


[2] HDFS mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.


Apache Pig Execution Mechanisms


Apache Pig scripts can be executed in three ways, namely – 


[1] Interactive Mode (Grunt shell)
You can run Apache Pig in interactive mode using the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump operator).

[2] Batch Mode (Script)
You can run Apache Pig in Batch mode by writing the Pig Latin script in a single file with .pig extension.

[3] Embedded Mode (UDF)
Apache Pig provides the provision of defining our own functions (User Defined Functions) in programming languages such as Java, and using them in our script.


STEP - 8: Invoke Grunt shell



Now you can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown below.

Local Mode
pig -x local

local mode

Local Mode executed

MapReduce
pig -x mapreduce

MapReduce Mode

MapReduce Mode executed

Either of these commands gives you the Grunt shell prompt as shown below.

Grunt Shell


Congratulations, Pig installed !!😊

STEP-9: Some hands on activities


[1] Create a text file ‘student.txt’ delimited by ‘,’

text file

[2] Load the file in a variable ‘input_file’
input_file = LOAD 'd:/student.text' USING PigStorage(',') 
as (Sr_No: Int, Student_Name:chararray, Student_Location:chararray);

Load query

Load text file


[3] Check result using DUMP operator (write result to the console)
DUMP input_file;


DUMP file

DUMP file result


[4] We can execute the pig script in batch mode also.


Step 1 - Write all the required Pig Latin statements in a single file. We can write all the Pig Latin statements and commands in a single file and save it as .pig file.

Pig script


Step 2 - You can execute it from the Grunt shell as well using the exec command as shown below.
exec /d:/student.pig


exec pig script

Executed the script

 Stay in touch for more posts.

7 comments:

  1. Short and to the point. The best part is you have given an example to check PIG is working as expected on Windows. Keep up the good work.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hi..
    I'm new in Apache Pig.
    I'm using Windows 10, Java 13, Hadoop 3.1.2, Hbase 2.2.2, Hive 3.1.2, and Pig 0.17.0.
    I have followed your all steps.
    I have put tools.jar in jdk-13.0.1\lib and pig-0.17.0\bin but Pig still can't locate this tools.jar like this :

    Error: Could not find or load main class D:\Program_Files\Java\jdk-13.0.1\lib\tools.jar
    Caused by: java.lang.ClassNotFoundException: D:\Program_Files\Java\jdk-13.0.1\lib\tools.jar

    I don't know how I can fix this anytime. Your help is really appreciated. :)

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. sa.pig

    data1 = LOAD '/e:/stud.txt' USING PigStorage(',') as (sid:int,sname:chararray);sord = ORDER data1 BY snd DESC;
    slim = LIMIT sord 2;
    DUMP slim;

    grunt> exec e:/sam/sa.pig
    2021-04-12 19:30:20,843 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
    2021-04-12 19:30:20,867 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
    2021-04-12 19:30:21,177 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias slim
    Details at logfile: C:\hadoop\logs\pig_1618233824706.log

    what is the error?
    please help me

    ReplyDelete