[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
spark:recipes:writing-text-files [2014/11/04 14:36]
straka created
spark:recipes:writing-text-files [2014/11/04 14:59] (current)
straka
Line 3: Line 3:
 Text files can be written easily by Spark. Text files can be written easily by Spark.
  
-===== Reading Text Files by Lines =====+===== Writing Text Files by Lines =====
  
 To write an ''RDD'' to a text file, each element on a line, the method ''sc.writeTextFile'' can be used: To write an ''RDD'' to a text file, each element on a line, the method ''sc.writeTextFile'' can be used:
Line 11: Line 11:
 The ''output_path'' always specifies a directory in which several output files are created. If the output directory already exists, an error occurs. The ''output_path'' always specifies a directory in which several output files are created. If the output directory already exists, an error occurs.
  
-Several output files, named ''part-00000'', ''part-00001'', etc., are created in the output directory, one for+Several output files, named ''part-00000'', ''part-00001'', etc., are created in the output directory, one for every partition of ''RDD''.
  
 ==== Sorting Output ==== ==== Sorting Output ====
Line 30: Line 30:
  
 ==== One Output File ==== ==== One Output File ====
 +
 +In many cases, only one output file is desirable. In that case, ''coalesce(1)'' method can be used, which merges all partitions into one.
 +<file python>
 +lines.coalesce(1).saveAsTextFile("output_path")
 +</file>
 +
 +In case sorting is also used, use ''coalesce'' **after** the sorting, so that the sorting can be executed in parallel and the partitions are merged only before performing the output.
 +
 +===== Writing Text Files by Paragraphs  =====
 +
 +The ''saveAsTextFile'' method always writes one newline between elements. If you want to separate elements by two newlines, append a newline to every element manually:
 +<file python>
 +lines.map(lambda line: str(line) + "\n").saveAsTextFile("output_path")
 +</file>
 +<file scala>
 +lines.map(_.toString + "\n").saveAsTextFile("output_path")
 +</file>
  

[ Back to the navigation ] [ Back to the content ]