Using Scala

In order to use Spark in Scala, environment has to bee set up according to Using Spark in UFAL Environment (including the sbt).

Starting Interactive Shell

Interactive shell can be started using:

spark-shell

As described in Running Spark on Single Machine or on Cluster, environmental variable MASTER specifies which Spark master to use (or whether to start a local one).

Usage Examples

Consider the following simple script computing 10 most frequent words of Czech Wikipedia:

(sc.textFile("/net/projects/spark-example-data/wiki-cs", 3*sc.defaultParallelism)
   .flatMap(_.split("\\s"))
   .map((_,1)).reduceByKey(_+_)
   .sortBy(_._2, ascending=false)
   .take(10))

run interactive shell using existing Spark cluster (i.e., inside spark-qrsh), or start local Spark cluster using as many threads as there are cores if there is none:

spark-shell

run interactive shell with local Spark cluster using one thread:

MASTER=local spark-shell

start Spark cluster (10 machines, 1GB RAM each) on SGE and run interactive shell:

spark-qrsh 10 1G spark-shell

Running Scala Spark Applications

Compiled Scala Spark program (JAR) can be started using:

spark-submit

As described in Running Spark on Single Machine or on Cluster, environmental variable MASTER specifies which Spark master to use (or whether to start a local one).

Compilation of Scala Spark Programs

If you do not know how to compile Scala programs, you can use the following directions:

create a directory for your project
copy /net/projects/spark/sbt/spark-template.sbt to your project directory and rename it to your project name (i.e., my-best-project.sbt)
replace the spark-template by your project name in the first line (i.e., name := “my-best-project”)
run sbt package to create JAR (note that first run of sbt will take several minutes)

The resulting JAR can be found in target/scala-2.10 subdirectory, named after your project.

Usage Examples

Consider the following simple word-count application word_count.scala:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
 
object Main {
  def main(args: Array[String]) {
    if (args.length < 2) sys.error("Usage: input output")
    val (input, output) = (args(0), args(1))
 
    val sc = new SparkContext()
    sc.textFile(input, 3*sc.defaultParallelism)
      .flatMap(_.split("\\s"))
      .map((_,1)).reduceByKey(_+_)
      .sortBy(_._2, ascending=false)
      .saveAsTextFile(output)
    sc.stop()
  }
}

The sbt project file word_count.sbt:

name := "word_count"name := "word_count"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"

compile the application

sbt package

run word_count application inside existing Spark cluster (i.e., inside spark-qsub or spark-qrsh), or start local Spark cluster using as many threads as there are cores if there is none:

spark-submit --class Main target/scala-2.10/word_count_2.10-1.0.jar input output

run word_count application with local Spark cluster using one thread:

MASTER=local spark-submit --class Main target/scala-2.10/word_count_2.10-1.0.jar input output

start Spark cluster (10 machines, 1GB RAM each) on SGE and run word_count application:

spark-qsub 10 1G spark-submit --class Main target/scala-2.10/word_count_2.10-1.0.jar input output

Note that the --class Main arguments are needed only because of Spark bug 4298 and will not be needed when fixed.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Using Scala

Starting Interactive Shell

Usage Examples

Running Scala Spark Applications

Compilation of Scala Spark Programs

Usage Examples