====== Using Scala ====== In order to use Spark in Scala, environment has to bee set up according to [[:spark#using-spark-in-ufal-environment|Using Spark in UFAL Environment]] (including the ''sbt''). ===== Starting Interactive Shell ===== Interactive shell can be started using: spark-shell As described in [[running-spark-on-single-machine-or-on-cluster|Running Spark on Single Machine or on Cluster]], environmental variable ''MASTER'' specifies which Spark master to use (or whether to start a local one). ==== Usage Examples ==== Consider the following simple script computing 10 most frequent words of Czech Wikipedia: (sc.textFile("/net/projects/spark-example-data/wiki-cs", 3*sc.defaultParallelism) .flatMap(_.split("\\s")) .map((_,1)).reduceByKey(_+_) .sortBy(_._2, ascending=false) .take(10)) * run interactive shell using existing Spark cluster (i.e., inside ''spark-srun''), or start local Spark cluster using as many threads as there are cores if there is none: spark-shell * run interactive shell with local Spark cluster using one thread: MASTER=local spark-shell * start Spark cluster (10 machines, 2GB RAM each) via Slurm and run interactive shell: spark-srun 10 2G spark-shell ===== Running Scala Spark Applications ===== Compiled Scala Spark program (JAR) can be started using: spark-submit As described in [[running-spark-on-single-machine-or-on-cluster|Running Spark on Single Machine or on Cluster]], environmental variable ''MASTER'' specifies which Spark master to use (or whether to start a local one). ==== Compilation of Scala Spark Programs ==== If you do not know how to compile Scala programs, you can use the following directions: - create a directory for your project - copy ''/net/projects/spark/sbt/spark-template.sbt'' to your project directory and rename it to your project name (i.e., ''my-best-project.sbt'') - replace the ''spark-template'' by your project name in the first line (i.e., ''name := "my-best-project"'') - run ''sbt package'' to create JAR (note that first run of ''sbt'' will take several minutes) The resulting JAR can be found in ''target/scala-2.11'' subdirectory, named after your project. ==== Usage Examples ==== Consider the following simple word-count application ''word_count.scala'': import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object Main { def main(args: Array[String]) { if (args.length < 2) sys.error("Usage: input output") val (input, output) = (args(0), args(1)) val sc = new SparkContext() sc.textFile(input, 3*sc.defaultParallelism) .flatMap(_.split("\\s")) .map((_,1)).reduceByKey(_+_) .sortBy(_._2, ascending=false) .saveAsTextFile(output) sc.stop() } } The ''sbt'' project file ''word_count.sbt'': name := "word_count" version := "1.0" scalaVersion := "2.12.17" libraryDependencies += "org.apache.spark" %% "spark-core" % "3.3.1" * compile the application sbt package * run ''word_count'' application inside existing Spark cluster (i.e., inside ''spark-sbatch'' or ''spark-srun''), or start local Spark cluster using as many threads as there are cores if there is none: spark-submit target/scala-2.12/word_count_2.12-1.0.jar /net/projects/spark-example-data/wiki-cs outdir * run ''word_count'' application with local Spark cluster using one thread: MASTER=local spark-submit target/scala-2.12/word_count_2.12-1.0.jar /net/projects/spark-example-data/wiki-cs outdir * start Spark cluster (10 machines, 2GB RAM each) on Slurm and run ''word_count'' application: spark-sbatch 10 2G spark-submit target/scala-2.12/word_count_2.12-1.0.jar /net/projects/spark-example-data/wiki-cs outdir