问题描述
我已经创建了一个RDD每个成员是一个键值对的键是一个 DenseVector
和值是一个 INT
。例如
[(DenseVector([3,4]),10),(DenseVector([3,4]),20)]
现在我想通过组关键 K1
: DenseVector([3,4])
。我期望的行为为分组键的所有值 K1
这是 10
和 20
。但结果我得到的是
[(DenseVector([3,4]),10),(DenseVector([3,4]),20)]
而不是
[(DenseVector([3,4]),[10,20])]
请让我知道如果我失去了一些东西。
在code为相同的是:
code的#simplified版本
#RDD1集是含有RDD [(DenseVector([3,4]),10),(DenseVector([3,4]),20)]
。rdd1.groupByKey()地图(拉姆达X:(X [0],列表(x [1])))
打印(rdd1.collect())
好吧,那是一个棘手的问题,简短的回答是你不能。要理解为什么你必须深入挖掘 DenseVector
的实施。 DenseVector
简直就是周围numpy的一个包装 float64
ndarray
>>> DV1 = DenseVector([3.0,4.0])
>>>类型(dv1.array)
<键入'numpy.ndarray'>
>>> dv1.array.dtype
DTYPE('float64')
由于numpy的 ndarrays
,不像 DenseVector
是可变的,不能以有意义的方式被散列,但有意思的东西提供 __ __哈希
方法。有覆盖这个问题一个有趣的问题。(参见:)
>>> dv1.array .__哈希__()是无
假
>>>哈希(dv1.array)
回溯(最近通话最后一个):
文件<&标准输入GT;,1号线,上述<&模块GT;
类型错误:unhashable类型:'numpy.ndarray
DenseVector
继承 __哈希__
从方法对象
,它只是基于一个 ID
(给定实例的内存地址):
>>> ID(D1)/ 16 ==哈希(D1)
真正
不幸的是这意味着两个 DenseVectors
具有相同的内容有不同的哈希值:
>>> DV2 = DenseVector([3.0,4.0])
>>>哈希(DV1)==哈希(DV2)
假
你能做些什么?最简单的就是用一个不可变的数据结构,提供一致的散
的实施,例如元组:
rdd.groupBy(拉姆达(K,V):元组(K))
注意:在实践中使用数组的一个关键是最有可能是一个坏主意。随着大量的散列过程元素可以远远昂贵是有用的。不过,如果你真的需要这样的事情斯卡拉似乎就好了工作:
进口org.apache.spark.mllib.linalg.VectorsVAL RDD = sc.parallelize(
(Vectors.dense(3,4),10)::(Vectors.dense(3,4),20)::无)
rdd.groupByKey.collect
I have created an RDD with each member being a key value pair with the key being a DenseVector
and value being an int
. e.g.
[(DenseVector([3,4]),10), (DenseVector([3,4]),20)]
Now I want to group by the key k1
: DenseVector([3,4])
. I expect the behaviour to be grouping all the values of the key k1
which are 10
and 20
. But the result I get is
[(DenseVector([3,4]), 10), (DenseVector([3,4]), 20)]
instead of
[(DenseVector([3,4]), [10,20])]
Please let me know if I am missing something.
The code for the same is :
#simplified version of code
#rdd1 is an rdd containing [(DenseVector([3,4]),10), (DenseVector([3,4]),20)]
rdd1.groupByKey().map(lambda x : (x[0], list(x[1])))
print(rdd1.collect())
Well, thats a tricky question and short answer is you can't. To understand why you'll have to dig deeper into DenseVector
implementation. DenseVector
is simply a wrapper around NumPy float64
ndarray
>>> dv1 = DenseVector([3.0, 4.0])
>>> type(dv1.array)
<type 'numpy.ndarray'>
>>> dv1.array.dtype
dtype('float64')
Since NumPy ndarrays
, unlike DenseVector
are mutable cannot be hashed in a meaningful way, although what is interesting provide __hash__
method. There is an interesting question which covers this issue (see: numpy ndarray hashability).
>>> dv1.array.__hash__() is None
False
>>> hash(dv1.array)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'numpy.ndarray'
DenseVector
inherits __hash__
method from object
and it is simply based on an id
(memory address of a given instance):
>>> id(d1) / 16 == hash(d1)
True
Unfortunately it means that two DenseVectors
with the same content have different hashes:
>>> dv2 = DenseVector([3.0, 4.0])
>>> hash(dv1) == hash(dv2)
False
What can you do? The simplest thing is to use an immutable data structure which provides consistent hash
implementation, for example tuple:
rdd.groupBy(lambda (k, v): tuple(k))
Note: In practice using arrays as a key is most likely a bad idea. With large number of elements hashing process can be far to expensive to be useful. Still, if you really need something like this Scala seems to work just fine:
import org.apache.spark.mllib.linalg.Vectors
val rdd = sc.parallelize(
(Vectors.dense(3, 4), 10) :: (Vectors.dense(3, 4), 20) :: Nil)
rdd.groupByKey.collect
这篇关于如何groupByKey一个RDD,与DenseVector为重点,在星火?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!