In order to use Spark in Scala, environment has to bee set up according to Using Spark in UFAL Environment (including the sbt
).
Interactive shell can be started using:
spark-shell
As described in Running Spark on Single Machine or on Cluster, environmental variable MASTER
specifies which Spark master to use (or whether to start a local one).
Consider the following simple script computing 10 most frequent words of Czech Wikipedia:
(sc.textFile("/net/projects/spark-example-data/wiki-cs", 3*sc.defaultParallelism) .flatMap(_.split("\\s")) .map((_,1)).reduceByKey(_+_) .sortBy(_._2, ascending=false) .take(10))
spark-srun
), or start local Spark cluster using as many threads as there are cores if there is none:spark-shell
MASTER=local spark-shell
spark-srun 10 2G spark-shell
Compiled Scala Spark program (JAR) can be started using:
spark-submit
As described in Running Spark on Single Machine or on Cluster, environmental variable MASTER
specifies which Spark master to use (or whether to start a local one).
If you do not know how to compile Scala programs, you can use the following directions:
/net/projects/spark/sbt/spark-template.sbt
to your project directory and rename it to your project name (i.e., my-best-project.sbt
)spark-template
by your project name in the first line (i.e., name := “my-best-project”
)sbt package
to create JAR (note that first run of sbt
will take several minutes)
The resulting JAR can be found in target/scala-2.11
subdirectory, named after your project.
Consider the following simple word-count application word_count.scala
:
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object Main { def main(args: Array[String]) { if (args.length < 2) sys.error("Usage: input output") val (input, output) = (args(0), args(1)) val sc = new SparkContext() sc.textFile(input, 3*sc.defaultParallelism) .flatMap(_.split("\\s")) .map((_,1)).reduceByKey(_+_) .sortBy(_._2, ascending=false) .saveAsTextFile(output) sc.stop() } }
The sbt
project file word_count.sbt
:
name := "word_count" version := "1.0" scalaVersion := "2.12.20" libraryDependencies += "org.apache.spark" %% "spark-core" % "3.5.3"
sbt package
word_count
application inside existing Spark cluster (i.e., inside spark-sbatch
or spark-srun
), or start local Spark cluster using as many threads as there are cores if there is none:spark-submit target/scala-2.12/word_count_2.12-1.0.jar /net/projects/spark-example-data/wiki-cs outdir
word_count
application with local Spark cluster using one thread:MASTER=local spark-submit target/scala-2.12/word_count_2.12-1.0.jar /net/projects/spark-example-data/wiki-cs outdir
word_count
application:spark-sbatch 10 2G spark-submit target/scala-2.12/word_count_2.12-1.0.jar /net/projects/spark-example-data/wiki-cs outdir