[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Table of Contents

Writing Text Files

Text files can be written easily by Spark.

Writing Text Files by Lines

To write an RDD to a text file, each element on a line, the method sc.writeTextFile can be used:

lines.saveAsTextFile("output_path")

The output_path always specifies a directory in which several output files are created. If the output directory already exists, an error occurs.

Several output files, named part-00000, part-00001, etc., are created in the output directory, one for every partition of RDD.

Sorting Output

If you want the output to be sorted, use sortBy method. The sortBy method must be given a lambda function which extracts from a given element a key, which is used during sorting. The elements are sorted in ascending order, but named parameter ascending with false value can be specified.
Python version:

lines.sortBy(lambda line: line)  # Sort whole lines
lines.sortBy(lambda (k, v): k)   # Sort pairs according to the first element
lines.sortBy(lambda line: line, ascending=False) # Sort in decreasing order

Scala version:

lines.sortBy(line=>line)      # Sort whole lines
lines.sortBy(line=>line._1)   # Sort pairs according to the first element
lines.sortBy(line=>line, ascending=false) # Sort in decreasing order

One Output File

In many cases, only one output file is desirable. In that case, coalesce(1) method can be used, which merges all partitions into one.

lines.coalesce(1).saveAsTextFile("output_path")

In case sorting is also used, use coalesce after the sorting, so that the sorting can be executed in parallel and the partitions are merged only before performing the output.

Writing Text Files by Paragraphs

The saveAsTextFile method always writes one newline between elements. If you want to separate elements by two newlines, append a newline to every element manually:

lines.map(lambda line: str(line) + "\n").saveAsTextFile("output_path")
lines.map(_.toString + "\n").saveAsTextFile("output_path")

[ Back to the navigation ] [ Back to the content ]