问题描述
我有一个愚蠢的问题涉及褶皱和减少Pyspark。我理解这两种方法之间的差异,但如果双方都需要的应用功能是可交换幺半,我想不通其中折不能被降低被取代的例子。
I've a silly question involving fold and reduce in Pyspark. I understand the difference between these two methods, but, if both need that the applied function is a commutative monoid, I cannot figure out an example in which fold cannot be substituted by reduce.
此外,在Pyspark实现折叠则使用 ACC = OP(OBJ,ACC)
,为什么这个操作顺序用于代替ACC = OP(ACC,OBJ)
? (这个二阶听起来更接近于一个leftFold我)
Besides, in the Pyspark implementation of fold it is used acc = op(obj, acc)
, why this operation order is used instead of acc = op(acc, obj)
? (this second order sounds more closed to a leftFold to me)
干杯
托马斯
推荐答案
空RDD
在 RDD
是空的,不能被取代:
It cannot be substituted when RDD
is empty:
val rdd = sc.emptyRDD[Int]
rdd.reduce(_ + _)
// java.lang.UnsupportedOperationException: empty collection at
// org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$ ...
rdd.fold(0)(_ + _)
// Int = 0
您当然可以结合减少
与条件对的isEmpty
,但它是相当难看。
You can of course combine reduce
with condition on isEmpty
but it is rather ugly.
可变缓冲
另一个用例折是可变的缓冲聚集。考虑以下RDD:
Another use case for fold is aggregation with mutable buffer. Consider following RDD:
import breeze.linalg.DenseVector
val rdd = sc.parallelize(Array.fill(100)(DenseVector(1)), 8)
比方说,我们希望所有元素的总和。一个天真的解决方案是简单地减少 +
:
rdd.reduce(_ + _)
不幸的是,为每个元素的新载体。由于对象的创建和随后的垃圾收集是昂贵的它可能是最好使用可变对象。这是不可能的减少
(RDD的永恒性并不意味着元素的不变性),但是可以用可实现折叠
如下:
Unfortunately it creates a new vector for each element. Since object creation and subsequent garbage collection is expensive it could be better to use a mutable object. It is not possible with reduce
(immutability of RDD doesn't imply immutability of the elements), but can be achieved with fold
as follows:
rdd.fold(DenseVector(0))((acc, x) => acc += x)
零元素在这里被用作每个分区离去实际数据原封不动初始化一次可变缓冲
Zero element is used here as mutable buffer initialized once per partition leaving actual data untouched.
ACC =为什么这个经营秩序是用来代替ACC运(OBJ,ACC),OP =(ACC,OBJ)
请参阅和的
这篇关于为什么是必要的,星火折行动?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!