问题描述
我想创建一个键,值
对的RDD,其中每个键都有一个唯一的值.这样做的目的是记住"键索引以供以后使用,因为键可能会在分区周围被打乱,并基本上创建了各种查找表.我正在对一些文本进行矢量化处理,并且需要创建特征矢量,因此我必须为每个键都具有唯一的值.
我尝试将第二个RDD压缩到我的键的RDD上,但是问题是,如果两个RDD的划分方式不完全相同,则最终会丢失元素.
我的第二次尝试是使用哈希生成器,例如在scikit-learn中使用的一种方法,但我想知道是否还有其他火花本机"方法?我使用的是PySpark,而不是Scala ...
zipWithIndex
和 zipWithUniqueId
刚刚添加到PySpark( https://github.com/apache/spark/pull/2092 ),并将在即将发布的Spark 1.1.0版本中提供(它们目前在Spark master
分支中可用.
如果您使用的是旧版本的Spark,则应该能够选择提交,以便向后移植这些功能,因为我认为它只会向 rdd.py
添加行./p>
I would like to create an RDD of key, value
pairs where each key would have a unique value. The purpose is to "remember" key indices for later use since keys might be shuffled around the partitions, and basically create a lookup table of sorts. I am vectorizing some text and need to create feature vectors so I have to have a unique value for each key.
I tried this with zipping a second RDD to my RDD of keys, but the problem is that if the two RDDs are not partitioned in exactly the same way, you end up losing elements.
My second attempt is to use a hash generator like the one used in scikit-learn but I'm wondering if there is some other "spark-native" way of doing this? I'm using PySpark, not Scala...
zipWithIndex
and zipWithUniqueId
were just added to PySpark (https://github.com/apache/spark/pull/2092) and will be available in the forthcoming Spark 1.1.0 release (they're currently available in the Spark master
branch).
If you're using an older version of Spark, you should be able cherry-pick that commit in order to backport these functions, since I think it only adds lines to rdd.py
.
这篇关于为Spark RDD中的每个键创建唯一的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!