使用 ReduceByKey 对值列表进行分组

本文介绍了使用 ReduceByKey 对值列表进行分组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想对每个键的值列表进行分组，并且正在做这样的事情:

I want to group list of values per key and was doing something like this:

sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two"))).groupByKey().collect.foreach(println)

(red,CompactBuffer(zero, two))
(yellow,CompactBuffer(one))

但我注意到 Databricks 的一篇博客文章，它建议不要将 groupByKey 用于大型数据集.

But I noticed a blog post from Databricks and it's recommending not to use groupByKey for large dataset.

避免 GroupByKey

有没有办法使用reduceByKey来达到同样的效果?

Is there a way to achieve the same result using reduceByKey?

我试过了，但它连接了所有值.顺便说一下，就我而言，键和值都是字符串类型.

I tried this but it's concatenating all values. By the way, for my case, both key and value are string type.

sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two"))).reduceByKey(_ ++ _).collect.foreach(println)

(red,zerotwo)
(yellow,one)

推荐答案

Use aggregateByKey:

 sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two")))
.aggregateByKey(ListBuffer.empty[String])(
        (numList, num) => {numList += num; numList},
         (numList1, numList2) => {numList1.appendAll(numList2); numList1})
.mapValues(_.toList)
.collect()

scala> Array[(String, List[String])] = Array((yellow,List(one)), (red,List(zero, two)))

请参阅此答案，了解关于 aggregateByKey 的详细信息、此链接了解使用可变数据集的基本原理ListBuffer.

See this answer for the details on aggregateByKey, this link for the rationale behind using a mutable dataset ListBuffer.

有没有办法使用reduceByKey来达到同样的效果?

上面的实际上性能更差，详情请看@zero323的评论.

The above is actually worse in performance, please see comments by @zero323 for the details.

这篇关于使用 ReduceByKey 对值列表进行分组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！