问题描述
我是 Spark 的新手,想了解 MapReduce 是如何在后台完成的,以确保我正确使用它.这篇文章提供了一个很好的答案,但我的结果似乎不符合所描述的逻辑.我在 Scala 中运行 Spark 快速入门 指南命令行.当我正确地添加行长时,结果就很好了.总行长为 1213:
I'm new to Spark and want to understand how MapReduce gets done under the hood to ensure I use it properly. This post provided a great answer, but my results don't seem to follow the logic described. I'm running the Spark Quick Start guide in Scala on command line. When I do line length addition properly, things come out just fine. Total line length is 1213:
scala> val textFile = sc.textFile("README.md")
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
scala> val linesWithSparkLengths = linesWithSpark.map(s => s.length)
scala> linesWithSparkLengths.foreach(println)
Result:
14
78
73
42
68
17
62
45
76
64
54
74
84
29
136
77
77
73
70
scala> val totalLWSparkLength = linesWithSparkLengths.reduce((a,b) => a+b)
totalLWSparkLength: Int = 1213
当我稍微调整它以使用 (a-b) 而不是 (a+b) 时,
When I tweak it slightly to use (a-b) instead of (a+b),
scala> val totalLWSparkTest = linesWithSparkLengths.reduce((a,b) => a-b)
根据这篇文章中的逻辑,我预计为-1185:
I expected -1185, according to the logic in this post:
List(14,78,73,42,68,17,62,45,76,64,54,74,84,29,136,77,77,73,70).reduce( (x,y) => x - y )
Step 1 : op( 14, 78 ) will be the first evaluation.
x is 14 and y is 78. Result of x - y = -64.
Step 2: op( op( 14, 78 ), 73 )
x is op(14,78) = -64 and y = 73. Result of x - y = -137
Step 3: op( op( op( 14, 78 ), 73 ), 42)
x is op( op( 14, 78 ), 73 ) = -137 and y is 42. Result is -179.
...
Step 18: op( (... ), 73), 70) will be the final evaluation.
x is -1115 and y is 70. Result of x - y is -1185.
然而,奇怪的事情发生了:
However, something strange happens:
scala> val totalLWSparkTest = linesWithSparkLengths.reduce((a,b) => a-b)
totalLWSparkTest: Int = 151
当我再次运行它时......
When I run it again...
scala> val totalLWSparkTest = linesWithSparkLengths.reduce((a,b) => a-b)
totalLWSparkTest: Int = -151
谁能告诉我为什么结果是 151(或 -151)而不是 -1185?
Can anyone tell me why the result is 151 (or -151) instead of -1185?
推荐答案
之所以会这样,是因为减法既不结合也不可交换.让我们从关联性开始:
It happens because subtraction is neither associative nor commutative. Lets start with associativity:
(- (- (- 14 78) 73) 42)
(- (- -64 73) 42)
(- -137 42)
-179
不一样
(- (- 14 78) (- 73 42))
(- -64 (- 73 42))
(- -64 31)
-95
现在是交换性的时候了:
Now its time for commutativity:
(- (- (- 14 78) 73) 42) ;; From the previous example
不一样
(- (- (- 42 73) 78) 14)
(- (- -31 78) 14)
(- -109 14)
-123
Spark 首先对各个分区应用 reduce
,然后以任意顺序合并部分结果.如果您使用的函数不满足一个或两个标准,则最终结果可能是不确定的.
Spark first applies reduce
on individual partitions and then merges partial results in arbitrary order. If function you use doesn't meet one or both criteria final results can be non-deterministic.
这篇关于Spark MapReduce 中的意外结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!