====== Writing Text Files ====== Text files can be written easily by Spark. ===== Writing Text Files by Lines ===== To write an ''RDD'' to a text file, each element on a line, the method ''sc.writeTextFile'' can be used: lines.saveAsTextFile("output_path") The ''output_path'' always specifies a directory in which several output files are created. If the output directory already exists, an error occurs. Several output files, named ''part-00000'', ''part-00001'', etc., are created in the output directory, one for every partition of ''RDD''. ==== Sorting Output ==== If you want the output to be sorted, use ''sortBy'' method. The ''sortBy'' method must be given a lambda function which extracts from a given element a key, which is used during sorting. The elements are sorted in ascending order, but named parameter ''ascending'' with ''false'' value can be specified. Python version: lines.sortBy(lambda line: line) # Sort whole lines lines.sortBy(lambda (k, v): k) # Sort pairs according to the first element lines.sortBy(lambda line: line, ascending=False) # Sort in decreasing order Scala version: lines.sortBy(line=>line) # Sort whole lines lines.sortBy(line=>line._1) # Sort pairs according to the first element lines.sortBy(line=>line, ascending=false) # Sort in decreasing order ==== One Output File ==== In many cases, only one output file is desirable. In that case, ''coalesce(1)'' method can be used, which merges all partitions into one. lines.coalesce(1).saveAsTextFile("output_path") In case sorting is also used, use ''coalesce'' **after** the sorting, so that the sorting can be executed in parallel and the partitions are merged only before performing the output. ===== Writing Text Files by Paragraphs ===== The ''saveAsTextFile'' method always writes one newline between elements. If you want to separate elements by two newlines, append a newline to every element manually: lines.map(lambda line: str(line) + "\n").saveAsTextFile("output_path") lines.map(_.toString + "\n").saveAsTextFile("output_path")