java - 如何用reduceByKey替换groupByKey以在Spark Java中作为Iterable值返回？

我有一个Spark Java程序，其中完成了一个带有mapValues步骤的groupByKey，它返回一个PairRDD，其值是所有输入rdd值的Iterable。
我已经读到，用mapValues代替groupByKey的reduceByKey可以提高性能，但是我不知道如何在这里将reduceByKey应用于我的问题。

具体来说，我有一个输入对RDD，其值的类型为Tuple5。经过groupByKey和mapValues转换之后，我需要获取一个键值对RDD，其中值需要是输入值的Iterable。

JavaPairRDD<Long,Tuple5<...>> inputRDD;
...
...
...
JavaPairRDD<Long, Iterable<Tuple5<...>>> groupedRDD = inputRDD
    .groupByKey()
    .mapValues(
            new Function<Iterable<Tuple5<...>>,Iterable<Tuple5<...>>>() {

                @Override
                public Iterable<Tuple5<...>> call(
                        Iterable<Tuple5<...>> v1)
                        throws Exception {

                    /*
                    Some steps here..
                    */

                    return mappedValue;
                }
            });

有没有一种方法可以使用reduceByKey获得上述转换？

最佳答案

我一直在Spark上使用Scala，所以这并不是您可能想要的确切答案。 groupByKey/mapValues和reduceByKey之间编码的主要区别可以使用从以下article改编而成的简单示例看出：

val words = Array("one", "two", "two", "three", "three", "three")
val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))

val wordCountsWithGroup = wordPairsRDD.
  groupByKey.
  mapValues(_.sum)
wordCountsWithGroup.collect
res1: Array[(String, Int)] = Array((two,2), (one,1), (three,3))

val wordCountsWithReduce = wordPairsRDD.
  reduceByKey(_ + _)
wordCountsWithReduce.collect
res2: Array[(String, Int)] = Array((two,2), (one,1), (three,3))

在此示例中，在x => x.sum中使用mapValues（即_.sum）的情况下，在(acc, x) => acc + x中将是reduceByKey（即_ + _）。功能签名有很大的不同。在mapValues中，您正在处理分组值的集合，而在reduceByKey中，您正在执行归约。