Reading Text Files

Text files can be read easily by Spark.

Reading Text Files by Lines

To read text file(s) line by line, sc.textFile can be used. The argument to sc.textFile can be either a file, or a directory. If a directory is used, all (non-hidden) files in the directory are read.

  lines = sc.textFile("/net/projects/spark-example-data/wiki-cs")

The elements of the resulting RDD are lines of the input file.

Number of Partitions

By default, the file is split into 32MB chunks, but in at least 2 partitions. The minimum number of partitions (instead of default 2) can be specified as the second argument of textFile.

Note that the number of RDD partitions greatly affects parallelization possibilities – there are usually as many tasks as partitions. It is therefore important for an RDD to have at least partitions as the number of available workers sc.defaultParallelism. I would recommend to use 3*sc.defaultParallelism as in the following example:

  lines = sc.textFile("/net/projects/spark-example-data/wiki-cs", 3*sc.defaultParallelism)

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Reading Text Files

Reading Text Files by Lines

Number of Partitions