Differences

This shows you the differences between two versions of the page.

--- spark:spark-introduction [2014/11/03 19:02]
straka
+++ spark:spark-introduction [2014/11/03 20:37]
straka
@@ Line 70: / Line 70: @@
     return min((np.sum((point - centers[i]) ** 2), i) for i in range(len(centers)))[1]
-lines = sc.textFile("/net/projects/hadoop/examples/inputs/points-small/points.txt")
+lines = sc.textFile("/net/projects/hadoop/examples/inputs/points-small/points.txt", sc.defaultParallelism)
 data = lines.map(lambda line: np.array([float(x) for x in line.split()])).cache()
@@ Line 76: / Line 76: @@
 epsilon = 1e-3
-centers = data.takeSample(False, K   )    # Sample K random points
+centers = data.takeSample(False, K)       # Sample K random points
 for i in range(5):                        # Perform 5 iterations
     old_centers = sc.broadcast(centers)
@@ Line 89: / Line 89: @@
                .map(lambda (index, (sum, count)): sum / count)
                .collect())
     # If the change in center positions is less than epsilon, stop.
     centers_change = sum(np.sqrt(np.sum((a - b)**2)) for (a, b) in zip(centers, old_centers.value))
@@ Line 101: / Line 100: @@
 Note that explicit broadcasting used for ''centers'' object is not strictly needed -- if we used ''old_centers = centers'', the example would work too, but it would send a copy of ''old_centers'' to //every distributed task//, instead of once to every machine.
+For illustration, Scala version of the example follows. It works exactly as the Python version and uses ''breeze.linalg.Vector'' providing linear algebraic operations.
+<file scala>
+import breeze.linalg.Vector
+type Vector = breeze.linalg.Vector[Double]
+type Vectors = Array[Vector]
+def closestPoint(point : Vector, centers : Vectors) : Double =
+  centers.map(center => (center-point).norm(2)).zipWithIndex.min._2
+val lines = sc.textFile("/net/projects/hadoop/examples/inputs/points-small/points.txt", sc.defaultParallelism)
+val data = lines.map(line => Vector(line.split("\\s+").map(_.toDouble))).cache()
+val K = 50
+val epsilon = 1e-3
+var i = 0
+var centers_change = Double.PositiveInfinity
+var centers = data.takeSample(false, K)
+while (i < 10 && centers_change > epsilon) {
+  val old_centers = sc.broadcast(centers)
+  centers = (data
+             // For each point, find its closes center index.
+             .map(point => (closestPoint(point, old_centers.value), (point, 1)))
+             // Sum points and counts in each cluster.
+             .reduceByKey((a, b) => (a._1+b._1, a._2+b._2))
+             // Sort by cluster index.
+             .sortByKey()
+             // Compute the new centers by averaging corresponding points.
+             .map({case (index, (sum, count)) => sum :/ count.toDouble})
+             .collect())
+  // Compute change in center positions.
+  centers_change = (centers zip old_centers.value).map({case (a,b) => (a-b).norm(2)}).sum
+  old_centers.unpersist()
+  i += 1
+}
+print(centers.deep)
+</file>

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences