问题描述
我想对每个键的值列表进行分组,并且正在执行以下操作:
I want to group list of values per key and was doing something like this:
sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two"))).groupByKey().collect.foreach(println)
(red,CompactBuffer(zero, two))
(yellow,CompactBuffer(one))
但是我注意到Databricks的一篇博客文章,建议不要对大型数据集使用groupByKey.
But I noticed a blog post from Databricks and it's recommending not to use groupByKey for large dataset.
是否有一种方法可以使用reduceByKey达到相同的结果?
Is there a way to achieve the same result using reduceByKey?
我尝试过此操作,但它是将所有值连接在一起.顺便说一下,就我而言,键和值都是字符串类型.
I tried this but it's concatenating all values. By the way, for my case, both key and value are string type.
sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two"))).reduceByKey(_ ++ _).collect.foreach(println)
(red,zerotwo)
(yellow,one)
推荐答案
使用aggregateByKey
:
sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two")))
.aggregateByKey(ListBuffer.empty[String])(
(numList, num) => {numList += num; numList},
(numList1, numList2) => {numList1.appendAll(numList2); numList1})
.mapValues(_.toList)
.collect()
scala> Array[(String, List[String])] = Array((yellow,List(one)), (red,List(zero, two)))
有关aggregateByKey
的详细信息,请参见此答案, 此链接,以了解使用可变数据集ListBuffer
的背后原理.
See this answer for the details on aggregateByKey
, this link for the rationale behind using a mutable dataset ListBuffer
.
Is there a way to achieve the same result using reduceByKey?
以上实际上是更糟糕的性能,请查看@ zero323的评论以获取详细信息.
The above is actually worse in performance, please see comments by @zero323 for the details.
这篇关于使用ReduceByKey对值列表进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!