更换groupByKey（）与reduceByKey（）

本文介绍了更换groupByKey（）与reduceByKey（）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是从。我想在此基础上的。它的伟大工程，的但的我想换成 groupByKey（）与 reduceByKey（） ，但我不知道如何（我现在并不担心性能）。下面是相关缩小的code：

This is a follow up question from here. I am trying to implement k-means based on this implementation. It works great, but I would like to replace groupByKey() with reduceByKey(), but I am not sure how (I am not worried about performance now). Here is the relevant minified code:

val data = sc.textFile("dense.txt").map(
        t => (t.split("#")(0), parseVector(t.split("#")(1)))).cache()

val read_mean_centroids = sc.textFile("centroids.txt").map(
        t => (t.split("#")(0), parseVector(t.split("#")(1))))
var centroids = read_mean_centroids.takeSample(false, K, 42).map(x => x._2)
do {
    var closest = read_mean_centroids.map(p => (closestPoint(p._2, centroids), p._2))
    var pointsGroup = closest.groupByKey() // <-- THE VICTIM :)
    var newCentroids = pointsGroup.mapValues(ps => average(ps.toSeq)).collectAsMap()
    ..

注意的println（newCentroids）会给：

图（23 - >（-6.269305E-4，-0.0011746404，-4.08004E-5），8 - >（-5.108732E-4，7.336348E-4，-3.707591E-4），17 - > （-0.0016383086，-0.0016974678，1.45 ..

和的println（最近）：

MapPartitionsRDD [6]在地图kmeans.scala：75

推荐答案

您可以使用 aggregateByKey（）（比 reduceByKey更自然（）），这样来计算 newCentroids ：

You could use an aggregateByKey() (a bit more natural than reduceByKey()) like this to compute newCentroids:

val newCentroids = closest.aggregateByKey((Vector.zeros(dim), 0L))(
  (agg, v) => (agg._1 += v, agg._2 + 1L),
  (agg1, agg2) => (agg1._1 += agg2._1, agg1._2 + agg2._2)
).mapValues(agg => agg._1/agg._2).collectAsMap

对于这个工作，你需要计算你的数据，即昏暗的维度，而你只需要操作一次。你也许可以使用类似 VAL暗淡= data.first._2.length 。

For this to work you will need to compute the dimensionality of your data, i.e. dim, but you only need to do this once. You could probably use something like val dim = data.first._2.length.

这篇关于更换groupByKey（）与reduceByKey（）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！