Table of Contents
Writing Text Files
Text files can be written easily by Spark.
Writing Text Files by Lines
To write an RDD
to a text file, each element on a line, the method sc.writeTextFile
can be used:
lines.saveAsTextFile("output_path")
The output_path
always specifies a directory in which several output files are created. If the output directory already exists, an error occurs.
Several output files, named part-00000
, part-00001
, etc., are created in the output directory, one for every partition of RDD
.
Sorting Output
If you want the output to be sorted, use sortBy
method. The sortBy
method must be given a lambda function which extracts from a given element a key, which is used during sorting. The elements are sorted in ascending order, but named parameter ascending
with false
value can be specified.
Python version:
lines.sortBy(lambda line: line) # Sort whole lines lines.sortBy(lambda (k, v): k) # Sort pairs according to the first element lines.sortBy(lambda line: line, ascending=False) # Sort in decreasing order
Scala version:
lines.sortBy(line=>line) # Sort whole lines lines.sortBy(line=>line._1) # Sort pairs according to the first element lines.sortBy(line=>line, ascending=false) # Sort in decreasing order
One Output File
In many cases, only one output file is desirable. In that case, coalesce(1)
method can be used, which merges all partitions into one.
lines.coalesce(1).saveAsTextFile("output_path")
In case sorting is also used, use coalesce
after the sorting, so that the sorting can be executed in parallel and the partitions are merged only before performing the output.
Writing Text Files by Paragraphs
The saveAsTextFile
method always writes one newline between elements. If you want to separate elements by two newlines, append a newline to every element manually:
lines.map(lambda line: str(line) + "\n").saveAsTextFile("output_path")
lines.map(_.toString + "\n").saveAsTextFile("output_path")