====== Using Scala ======
In order to use Spark in Scala, environment has to bee set up according to [[:spark#using-spark-in-ufal-environment|Using Spark in UFAL Environment]] (including the ''sbt'').
===== Starting Interactive Shell =====
Interactive shell can be started using:
spark-shell
As described in [[running-spark-on-single-machine-or-on-cluster|Running Spark on Single Machine or on Cluster]], environmental variable ''MASTER'' specifies which Spark master to use (or whether to start a local one).
==== Usage Examples ====
Consider the following simple script computing 10 most frequent words of Czech Wikipedia:
(sc.textFile("/net/projects/spark-example-data/wiki-cs", 3*sc.defaultParallelism)
.flatMap(_.split("\\s"))
.map((_,1)).reduceByKey(_+_)
.sortBy(_._2, ascending=false)
.take(10))
* run interactive shell using existing Spark cluster (i.e., inside ''spark-srun''), or start local Spark cluster using as many threads as there are cores if there is none:
spark-shell
* run interactive shell with local Spark cluster using one thread:
MASTER=local spark-shell
* start Spark cluster (10 machines, 2GB RAM each) via Slurm and run interactive shell:
spark-srun 10 2G spark-shell
===== Running Scala Spark Applications =====
Compiled Scala Spark program (JAR) can be started using:
spark-submit
As described in [[running-spark-on-single-machine-or-on-cluster|Running Spark on Single Machine or on Cluster]], environmental variable ''MASTER'' specifies which Spark master to use (or whether to start a local one).
==== Compilation of Scala Spark Programs ====
If you do not know how to compile Scala programs, you can use the following directions:
- create a directory for your project
- copy ''/net/projects/spark/sbt/spark-template.sbt'' to your project directory and rename it to your project name (i.e., ''my-best-project.sbt'')
- replace the ''spark-template'' by your project name in the first line (i.e., ''name := "my-best-project"'')
- run ''sbt package'' to create JAR (note that first run of ''sbt'' will take several minutes)
The resulting JAR can be found in ''target/scala-2.11'' subdirectory, named after your project.
==== Usage Examples ====
Consider the following simple word-count application ''word_count.scala'':
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Main {
def main(args: Array[String]) {
if (args.length < 2) sys.error("Usage: input output")
val (input, output) = (args(0), args(1))
val sc = new SparkContext()
sc.textFile(input, 3*sc.defaultParallelism)
.flatMap(_.split("\\s"))
.map((_,1)).reduceByKey(_+_)
.sortBy(_._2, ascending=false)
.saveAsTextFile(output)
sc.stop()
}
}
The ''sbt'' project file ''word_count.sbt'':
name := "word_count"
version := "1.0"
scalaVersion := "2.12.20"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.5.3"
* compile the application
sbt package
* run ''word_count'' application inside existing Spark cluster (i.e., inside ''spark-sbatch'' or ''spark-srun''), or start local Spark cluster using as many threads as there are cores if there is none:
spark-submit target/scala-2.12/word_count_2.12-1.0.jar /net/projects/spark-example-data/wiki-cs outdir
* run ''word_count'' application with local Spark cluster using one thread:
MASTER=local spark-submit target/scala-2.12/word_count_2.12-1.0.jar /net/projects/spark-example-data/wiki-cs outdir
* start Spark cluster (10 machines, 2GB RAM each) on Slurm and run ''word_count'' application:
spark-sbatch 10 2G spark-submit target/scala-2.12/word_count_2.12-1.0.jar /net/projects/spark-example-data/wiki-cs outdir