====== Writing Text Files ======
Text files can be written easily by Spark.
===== Writing Text Files by Lines =====
To write an ''RDD'' to a text file, each element on a line, the method ''sc.writeTextFile'' can be used:
lines.saveAsTextFile("output_path")
The ''output_path'' always specifies a directory in which several output files are created. If the output directory already exists, an error occurs.
Several output files, named ''part-00000'', ''part-00001'', etc., are created in the output directory, one for every partition of ''RDD''.
==== Sorting Output ====
If you want the output to be sorted, use ''sortBy'' method. The ''sortBy'' method must be given a lambda function which extracts from a given element a key, which is used during sorting. The elements are sorted in ascending order, but named parameter ''ascending'' with ''false'' value can be specified.
Python version:
lines.sortBy(lambda line: line) # Sort whole lines
lines.sortBy(lambda (k, v): k) # Sort pairs according to the first element
lines.sortBy(lambda line: line, ascending=False) # Sort in decreasing order
Scala version:
lines.sortBy(line=>line) # Sort whole lines
lines.sortBy(line=>line._1) # Sort pairs according to the first element
lines.sortBy(line=>line, ascending=false) # Sort in decreasing order
==== One Output File ====
In many cases, only one output file is desirable. In that case, ''coalesce(1)'' method can be used, which merges all partitions into one.
lines.coalesce(1).saveAsTextFile("output_path")
In case sorting is also used, use ''coalesce'' **after** the sorting, so that the sorting can be executed in parallel and the partitions are merged only before performing the output.
===== Writing Text Files by Paragraphs =====
The ''saveAsTextFile'' method always writes one newline between elements. If you want to separate elements by two newlines, append a newline to every element manually:
lines.map(lambda line: str(line) + "\n").saveAsTextFile("output_path")
lines.map(_.toString + "\n").saveAsTextFile("output_path")