====== Spark: Framework for Distributed Computations ====== [[http://spark.apache.org|{{:spark:spark-logo.png?150 }}]] [[http://spark.apache.org|Spark]] is a framework for distributed computations. Natively it works in Python, Scala and Java, and can be used limitedly in Perl using pipes. Apart from embarrassingly parallel computations, Spark framework is suitable for //in-memory// and/or //iterative// computations, making it suitable even for machine learning and complex data processing. (The Spark framework shares some underlying implementation with [[http://http://hadoop.apache.org/|Hadoop]], but it is quite different -- Hadoop framework does not offer in-memory computations and has only limited support for iterative computations.) The Spark framework can run either locally using one thread, locally using multiple threads or in a distributed fashion. ===== Basic Information ===== All Python, Scala and Java bindings work well in UFAL Environment. The displayed examples here are in Python and Scala. We do not discuss the Java binding, because it has the same API as Spark (and if you are a Java fan or know Java substantially better than Spark, you will be able to use it by yourself). Currently (Oct 2024), Spark 3.5.3 is available. ===== Getting Started ===== * [[spark:Spark Introduction]] * Official [[http://spark.apache.org/docs/latest/quick-start.html|Quick Start]] * Official [[http://spark.apache.org/docs/latest/programming-guide.html|Spark Programming Guide]] * Official [[http://spark.apache.org/docs/latest/mllib-guide.html|MLlib Programming Guide]] (Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives) * Official [[http://spark.apache.org/docs/latest/api/python/index.html|Python API Reference]]/[[https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html|Scala API Reference]] ===== Using Spark in UFAL Environment ===== Latest supported version of Spark is available in ''/net/projects/spark''. To use it, add export PATH="/net/projects/spark/bin:/net/projects/spark/slurm:$PATH" to your ''.bashrc'' (or to your favourite shell config file). If you want to use Scala and do not have ''sbt'' already installed (or you do not know what ''sbt'' is), add also export PATH="/net/projects/spark/sbt/bin:$PATH" * [[spark:Running Spark on Single Machine or on Cluster]] * [[spark:Using Python]] * [[spark:Using Scala]] ===== Recipes ===== * [[spark:recipes:Reading Text Files]] * [[spark:recipes:Writing Text Files]] * [[spark:recipes:Storing Data in Binary Format]] * [[spark:recipes:Using Perl via Pipes]]