问题描述
我正在使用Spark 1.3.1,我很好奇为什么Spark不允许在地图端合并中使用数组键.combineByKey function
:
I'm using Spark 1.3.1 and I'm curious why Spark doesn't allow using array keys on map-side combining.Piece of combineByKey function
:
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
}
推荐答案
基本上出于相同原因,为什么 默认分区程序无法分区阵列键 .
Basically for the same reason why default partitioner cannot partition array keys.
Scala Array
只是Java数组的包装,而它的hashCode
不依赖于内容:
Scala Array
is just a wrapper around Java array and its hashCode
doesn't depend on a content:
scala> val x = Array(1, 2, 3)
x: Array[Int] = Array(1, 2, 3)
scala> val h = x.hashCode
h: Int = 630226932
scala> x(0) = -1
scala> x.hashCode() == h1
res3: Boolean = true
这意味着内容完全相同的两个数组不相等
It means that two arrays with exact the same content are not equal
scala> x
res4: Array[Int] = Array(-1, 2, 3)
scala> val y = Array(-1, 2, 3)
y: Array[Int] = Array(-1, 2, 3)
scala> y == x
res5: Boolean = false
结果Arrays
不能用作有意义的键.如果您不确定,请检查使用Array
作为Scala Map
的密钥时会发生什么:
As result Arrays
cannot be used as a meaningful keys. If you're not convinced just check what happens when you use Array
as key for Scala Map
:
scala> Map(Array(1) -> 1, Array(1) -> 2)
res7: scala.collection.immutable.Map[Array[Int],Int] = Map(Array(1) -> 1, Array(1) -> 2)
如果要将集合用作键,则应使用不可变的数据结构,例如Vector
或List
.
If you want to use a collection as key you should use an immutable data structure like a Vector
or a List
.
scala> Map(Array(1).toVector -> 1, Array(1).toVector -> 2)
res15: scala.collection.immutable.Map[Vector[Int],Int] = Map(Vector(1) -> 2)
另请参阅:
- SI-1607
- How does HashPartitioner work?
- A list as a key for PySpark's reduceByKey
这篇关于为什么Spark不允许映射端与数组键组合?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!