Introduction
In my previous posts we have gone through Hadoop, HBase, Hive and Pig installation as well some correlated hands on activities. Now this is time to connect with Spark backed by Scala programming language.
Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations, which includes Interactive Queries and Stream Processing.
Spark is an open source cluster-computing framework written in Scala, Java, Python and R. Choosing a programming language for Apache Spark is a subjective matter because the reasons, why a particular data scientist or a data analyst likes Python, Java or Scala for Apache Spark, might not always be applicable to others.
Scala is a high-level programming language, which is a combination of object-oriented programming and functional programming. It is highly scalable, which is why it is called Scala. Scala is one of the new languages that is based on the Java Virtual Machine compatibility. It has a lot of similarities and differences as compared to the Java programming language.
I will go through Spark (2.3.1) installation backed by Scala (2.12.7), though you can use any stable version. I am assuming Java and Hadoop is pre-installed on the workstations, if not then go to my previous post Hadoop installation on Windows 10.
Validate the versions –
Mine exists under D: drive, might be you will be in different locations –
[1] Scala (2.12.7) –
https://downloads.lightbend.com/scala/2.12.7/scala-2.12.7.zip
[2] Spark (2.3.1) –
http://spark.apache.org/downloads.html
https://www.apache.org/dyn/closer.lua/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
https://www.apache.org/dyn/closer.lua/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
STEP - 1
Extract file scala-2.12.7.zip and place under "D:\Scala", you can use any preferred location –
[1] You will get again a folder post extraction –
[2] Copy the leaf folder “scala-2.12.7” and move to the root folder "D:\Scala" and removed all other files and folders –
STEP - 2
Similar to Scala, extract file spark-2.3.1-bin-hadoop2.7.tgz and place under "D:\Spark", you can use any preferred location –
[1] You will get again a tar file post extraction –
[2] Go inside of spark-2.3.1-bin-hadoop2.7 folder and extract again –
STEP - 3
Set the path for the following Environment variables (User Variables) on windows 10 –
- SCALA_HOME - D:\Scala\scala-2.12.7
- SPARK_HOME - D:\Spark\spark-2.3.1-bin-hadoop2.7
Next onward need to set System variables, including Scala and Spark bin directory path –
Variable: Path
Value:
- D:\Scala\scala-2.12.7\bin
- D:\Spark\spark-2.3.1-bin-hadoop2.7\bin
STEP - 5
Here need to start Scala first -
Open command prompt (administrator mode advisable) and verify the Scala version –
We are all good to start the Scala shell in which we can type the programs and see the output in the shell itself, type Scala –
Congratulations, Scala installed !! 😊
STEP-6
Some hands on activities –
[1] Simple print function, type the following text and press enter key –
[2] Some simple addition of inputs –
[3] The "Hello, world!" Program
As a first example, we use the standard Hello world program to demonstrate the use of the Scala tools without knowing too much about the language.
object HelloWorld {
def main(args: Array[String]) {
println("Hello, world!")
}
}
The structure of this program is a bit familiar with Java, consists of the method main, which prints out a friendly greeting to the standard output.
::#!
@echo off
call scala %0 %*
goto :eof
::!#
println("Hello, Welcome to Scala Script.....!!!!!")
object Message {
def main(args: Array[String]) {
println("Hello, " + args(0) + ".....!!!!!")
println("Welcome to Scala Script.....!!!!!")
println("You are " + args(1)+ " years old." )
}
}
Message.main(args)
Save the file as – scriptScala.bat, I did at D: drive.
Now execute it –
D:\> scalaScript.bat Mind 25
[5] Alternatively, use the following instructions to write a Scala program in script mode. Open notepad and add the following code into it.
object HelloWorld {
/* This is my first java program.
* This will print 'Hello World' as the output
*/
def main(args: Array[String]) {
println("Hello, world!") // prints Hello World
}
}
Save the file as − HelloWorld.scala in D: drive, you can use preferred location.
[6] The ‘scalac’ command is used to compile the Scala program and it will generate a few class files in the current directory.
[7] One of the above created class will be called HelloWorld.class. This is a bytecode, which will run on the Java Virtual Machine (JVM) using ‘scala’ command.
Use the following command to compile and execute your Scala program.
STEP-7
I already set the Spark path and variables, because using the Prebuilt Spark package. Here need to start Spark first -
Open command prompt and change directory to “D:\Spark\spark-2.3.1-bin-hadoop2.7\bin" and type "spark-shell" to start Spark.
Spark Shell
Spark provides an interactive shell − a powerful tool to analyze data interactively. It is available in either Scala or Python language. Here I am moving to Scala.
Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop Input Formats (such as HDFS files) or by transforming other RDDs.
STEP-8
Some hands on activity (word count example), create simple RDD–
[1] Create a text file ‘sparkSample.txt’ and save in D: drive, you can preferred place –
[2] Create a simple RDD from the above created text file, use the following command –
[3] Execute Word count Transformation
Now is the step to count the number of words -
- Our aim is to count the words in a file. Create a flat map for splitting each line into words (flatMap(line ⇒ line.split(“ ”)).
- Next, read each word as a key with a value ‘1’ (<key, value> = <word,1>)using map function (map(word ⇒ (word, 1)).
- Finally, reduce those keys by adding values of similar keys (reduceByKey(_+_)).
scala> val counts = inputfile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_);
[4] Current RDD
While working with the RDD, if you want to know about current RDD, then use the following command. It will show you the description about current RDD and its dependencies for debugging.
scala> counts.toDebugString
[5] The next step is to store the output in a text file and exit the spark shell.
You can validate the file and details –
No comments:
Post a Comment