Spark: Framework for Distributed Computations (Under Construction)

Spark is a framework for distributed computations. Natively it works in Python, Scala and Java, although it can be used limitedly in Perl using pipes.

Apart from embarrassingly parallel computations, Spark framework is suitable for in-memory and/or iterative computations, making it suitable even for machine learning and complex data processing. (The Spark framework shares some underlying implementation with Hadoop, but it is quite different – Hadoop framework does not offer in-memory computations and has only limited support for iterative computations.)

The Spark framework can run either locally using one thread, locally using multiple threads or in a distributed fashion.

Basic Information

All Python, Scala and Java bindings work well in UFAL Environment. The displayed examples here are in Python and Scala. We do not discuss the Java binding, because it has the same API as Spark (and if you are a Java fan or know Java substantially better than Spark, you will be able to use it by yourself).

Getting Started

Spark Introduction
Official Quick Start
Official Spark Programming Guide
Official Python API Reference/Scala API Reference

Using Spark in UFAL Environment

Latest supported version of Spark is available in /net/projects/spark. To use it, add

export PATH="/net/projects/spark/bin:/net/projects/spark/sge:$PATH"

to your .bashrc (or .profile and log in again; or to your favourite shell config file). If you want to use Scala and do not have sbt already installed (or you do not know what it is), add also

export PATH="/net/projects/spark/sbt/bin:$PATH"

Recipes

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Spark: Framework for Distributed Computations (Under Construction)

Basic Information

Getting Started

Using Spark in UFAL Environment

Recipes