This is an old revision of the document!
Table of Contents
Using Scala
In order to use Spark in Scala, environment has to bee set up according to Using Spark in UFAL Environment (including the sbt
).
Starting Interactive Shell
Interactive shell can be started using:
spark-shell
As described in Running Spark on Single Machine or on Cluster, environmental variable MASTER
specifies which Spark master to use (or whether to start a local one).
Usage Examples
Consider the following simple script computing 10 most frequent words of Czech Wikipedia:
(sc.textFile("/net/projects/spark-example-data/wiki-cs", 3*sc.defaultParallelism) .flatMap(_.split("\\s")) .map((_,1)).reduceByKey(_+_) .sortBy(_._2, ascending=false) .take(10))
- run interactive shell using existing Spark cluster (i.e., inside
spark-qrsh
), or start local Spark cluster using as many threads as there are cores if there is none:
spark-shell
- run interactive shell with local Spark cluster using one thread:
MASTER=local spark-shell
- start Spark cluster (10 machines, 1GB RAM each) on SGE and run interactive shell:
spark-qrsh 10 1G spark-shell
Running Scala Spark Applications
Compiled Scala Spark program (JAR) can be started using:
spark-submit
As described in Running Spark on Single Machine or on Cluster, environmental variable MASTER
specifies which Spark master to use (or whether to start a local one).
Compilation of Scala Spark Programs
If you do not know how to compile Scala programs, you can use the following directions:
- create a directory for your project
- copy
/net/projects/spark/sbt/spark-template.sbt
to your project directory and rename it to your project name (i.e.,my-best-project.sbt
) - replace the
spark-template
by your project name in the first line (i.e.,name := “my-best-project”
) - run
sbt package
to create JAR (note that first run ofsbt
will take several minutes)
The resulting JAR can be found in target/scala-2.10
subdirectory, named after your project.
Usage Examples
Consider the following simple word-count application word_count.scala
:
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object Main { def main(args: Array[String]) { if (args.length < 2) sys.error("Usage: input output") val (input, output) = (args(0), args(1)) val sc = new SparkContext() sc.textFile(input, 3*sc.defaultParallelism) .flatMap(_.split("\\s")) .map((_,1)).reduceByKey(_+_) .sortBy(_._2, ascending=false) .saveAsTextFile(output) sc.stop() } }
The sbt
project file word_count.sbt
:
name := "word_count"name := "word_count" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
- compile the application
sbt package
- run
word_count
application inside existing Spark cluster (i.e., insidespark-qsub
orspark-qrsh
), or start local Spark cluster using as many threads as there are cores if there is none:
spark-submit --class Main target/scala-2.10/word_count_2.10-1.0.jar input output
- run
word_count
application with local Spark cluster using one thread:
MASTER=local spark-submit --class Main target/scala-2.10/word_count_2.10-1.0.jar input output
- start Spark cluster (10 machines, 1GB RAM each) on SGE and run
word_count
application:
spark-qsub 10 1G spark-submit --class Main target/scala-2.10/word_count_2.10-1.0.jar input output
Note that the -
-class Main
arguments are needed only because of Spark bug 4298 and will not be needed when fixed.