In order to use Spark in Scala, environment has to bee set up according to Using Spark in UFAL Environment (including the sbt).
Interactive shell can be started using:
spark-shell
As described in Running Spark on Single Machine or on Cluster, environmental variable MASTER specifies which Spark master to use (or whether to start a local one).
Consider the following simple script computing 10 most frequent words of Czech Wikipedia:
(sc.textFile("/net/projects/spark-example-data/wiki-cs", 3*sc.defaultParallelism) .flatMap(_.split("\\s")) .map((_,1)).reduceByKey(_+_) .sortBy(_._2, ascending=false) .take(10))
spark-srun), or start local Spark cluster using as many threads as there are cores if there is none:spark-shell
MASTER=local spark-shell
spark-srun 10 2G spark-shell
Compiled Scala Spark program (JAR) can be started using:
spark-submit
As described in Running Spark on Single Machine or on Cluster, environmental variable MASTER specifies which Spark master to use (or whether to start a local one).
If you do not know how to compile Scala programs, you can use the following directions:
/net/projects/spark/sbt/spark-template.sbt to your project directory and rename it to your project name (i.e., my-best-project.sbt)spark-template by your project name in the first line (i.e., name := “my-best-project”)sbt package to create JAR (note that first run of sbt will take several minutes)
The resulting JAR can be found in target/scala-2.11 subdirectory, named after your project.
Consider the following simple word-count application word_count.scala:
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object Main { def main(args: Array[String]) { if (args.length < 2) sys.error("Usage: input output") val (input, output) = (args(0), args(1)) val sc = new SparkContext() sc.textFile(input, 3*sc.defaultParallelism) .flatMap(_.split("\\s")) .map((_,1)).reduceByKey(_+_) .sortBy(_._2, ascending=false) .saveAsTextFile(output) sc.stop() } }
The sbt project file word_count.sbt:
name := "word_count" version := "1.0" scalaVersion := "2.12.20" libraryDependencies += "org.apache.spark" %% "spark-core" % "3.5.3"
sbt package
word_count application inside existing Spark cluster (i.e., inside spark-sbatch or spark-srun), or start local Spark cluster using as many threads as there are cores if there is none:spark-submit target/scala-2.12/word_count_2.12-1.0.jar /net/projects/spark-example-data/wiki-cs outdir
word_count application with local Spark cluster using one thread:MASTER=local spark-submit target/scala-2.12/word_count_2.12-1.0.jar /net/projects/spark-example-data/wiki-cs outdir
word_count application:spark-sbatch 10 2G spark-submit target/scala-2.12/word_count_2.12-1.0.jar /net/projects/spark-example-data/wiki-cs outdir