如何groupByKey一个RDD，与DenseVector为重点，在星火？

本文介绍了如何groupByKey一个RDD，与DenseVector为重点，在星火？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经创建了一个RDD每个成员是一个键值对的键是一个 DenseVector 和值是一个 INT 。例如

  [（DenseVector（[3,4]），10），（DenseVector（[3,4]），20）]

现在我想通过组关键 K1 ： DenseVector（[3,4]）。我期望的行为为分组键的所有值 K1 这是 10 和 20 。但结果我得到的是

  [（DenseVector（[3,4]），10），（DenseVector（[3,4]），20）]

而不是

  [（DenseVector（[3,4]），[10,20]）]

请让我知道如果我失去了一些东西。

在code为相同的是：

  code的#simplified版本
＃RDD1集是含有RDD [（DenseVector（[3,4]），10），（DenseVector（[3,4]），20）]
。rdd1.groupByKey（）地图（拉姆达X：（X [0]，列表（x [1]）））
打印（rdd1.collect（））

解决方案

好吧，那是一个棘手的问题，简短的回答是你不能。要理解为什么你必须深入挖掘 DenseVector 的实施。 DenseVector 简直就是周围numpy的一个包装 float64 ndarray

 ＆GT;＆GT;＆GT; DV1 = DenseVector（[3.0,4.0]）
＆GT;＆GT;＆GT;类型（dv1.array）
＆LT;键入'numpy.ndarray'＆GT;
＆GT;＆GT;＆GT; dv1.array.dtype
DTYPE（'float64'）

由于numpy的 ndarrays ，不像 DenseVector 是可变的，不能以有意义的方式被散列，但有意思的东西提供 __ __哈希方法。有覆盖这个问题一个有趣的问题。（参见：）

 ＆GT;＆GT;＆GT; dv1.array .__哈希__（）是无
假
＆GT;＆GT;＆GT;哈希（dv1.array）
回溯（最近通话最后一个）：
  文件＆LT;＆标准输入GT;，1号线，上述＆lt;＆模块GT;
类型错误：unhashable类型：'numpy.ndarray

DenseVector 继承 __哈希__ 从方法对象，它只是基于一个 ID （给定实例的内存地址）：

 ＆GT;＆GT;＆GT; ID（D1）/ 16 ==哈希（D1）
真正

不幸的是这意味着两个 DenseVectors 具有相同的内容有不同的哈希值：

 ＆GT;＆GT;＆GT; DV2 = DenseVector（[3.0,4.0]）
＆GT;＆GT;＆GT;哈希（DV1）==哈希（DV2）
假

你能做些什么？最简单的就是用一个不可变的数据结构，提供一致的散的实施，例如元组：

  rdd.groupBy（拉姆达（K，V）：元组（K））

注意：在实践中使用数组的一个关键是最有可能是一个坏主意。随着大量的散列过程元素可以远远昂贵是有用的。不过，如果你真的需要这样的事情斯卡拉似乎就好了工作：

 进口org.apache.spark.mllib.linalg.VectorsVAL RDD = sc.parallelize（
    （Vectors.dense（3，4），10）::（Vectors.dense（3，4），20）::无）
rdd.groupByKey.collect

I have created an RDD with each member being a key value pair with the key being a DenseVector and value being an int. e.g.

[(DenseVector([3,4]),10),  (DenseVector([3,4]),20)]

Now I want to group by the key k1: DenseVector([3,4]). I expect the behaviour to be grouping all the values of the key k1 which are 10 and 20. But the result I get is

[(DenseVector([3,4]), 10), (DenseVector([3,4]), 20)]

instead of

[(DenseVector([3,4]), [10,20])]

Please let me know if I am missing something.

The code for the same is :

#simplified version of code
#rdd1 is an rdd containing [(DenseVector([3,4]),10),  (DenseVector([3,4]),20)]
rdd1.groupByKey().map(lambda x : (x[0], list(x[1])))
print(rdd1.collect())

解决方案

Well, thats a tricky question and short answer is you can't. To understand why you'll have to dig deeper into DenseVector implementation. DenseVector is simply a wrapper around NumPy float64 ndarray

>>> dv1 = DenseVector([3.0, 4.0])
>>> type(dv1.array)
<type 'numpy.ndarray'>
>>> dv1.array.dtype
dtype('float64')

Since NumPy ndarrays, unlike DenseVector are mutable cannot be hashed in a meaningful way, although what is interesting provide __hash__ method. There is an interesting question which covers this issue (see: numpy ndarray hashability).

>>> dv1.array.__hash__() is None
False
>>> hash(dv1.array)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'numpy.ndarray'

DenseVector inherits __hash__ method from object and it is simply based on an id (memory address of a given instance):

>>> id(d1) / 16 == hash(d1)
True

Unfortunately it means that two DenseVectors with the same content have different hashes:

>>> dv2 = DenseVector([3.0, 4.0])
>>> hash(dv1) == hash(dv2)
False

What can you do? The simplest thing is to use an immutable data structure which provides consistent hash implementation, for example tuple:

rdd.groupBy(lambda (k, v): tuple(k))

Note: In practice using arrays as a key is most likely a bad idea. With large number of elements hashing process can be far to expensive to be useful. Still, if you really need something like this Scala seems to work just fine:

import org.apache.spark.mllib.linalg.Vectors

val rdd = sc.parallelize(
    (Vectors.dense(3, 4), 10) :: (Vectors.dense(3, 4), 20) :: Nil)
rdd.groupByKey.collect

这篇关于如何groupByKey一个RDD，与DenseVector为重点，在星火？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！