本文介绍了无法成对使用不同数量的项目对RDD进行反序列化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有两个具有键值对的RDD.我想通过键将它们加入(并根据键获取所有值的笛卡尔积),我认为可以使用pyspark的zip()函数来完成.但是,当我应用此方法时,
I have two RDD's which have key-value pairs. I want to join them by key (and according to the key, get cartesian product of all values), which I assume can be done with zip() function of pyspark. However, when I apply this,
elemPairs = elems1.zip(elems2).reduceByKey(add)
它给了我错误:
Cannot deserialize RDD with different number of items in pair: (40, 10)
这是我尝试压缩的2个RDD:
And here are the 2 RDD's which I try to zip:
elems1 => [((0, 0), ('A', 0, 90)), ((0, 1), ('A', 0, 90)), ((0, 2), ('A', 0, 90)), ((0, 3), ('A', 0, 90)), ((0, 4), ('A', 0, 90)), ((0, 0), ('A', 1, 401)), ((0, 1), ('A', 1, 401)), ((0, 2), ('A', 1, 401)), ((0, 3), ('A', 1, 401)), ((0, 4), ('A', 1, 401)), ((1, 0), ('A', 0, 305)), ((1, 1), ('A', 0, 305)), ((1, 2), ('A', 0, 305)), ((1, 3), ('A', 0, 305)), ((1, 4), ('A', 0, 305)), ((1, 0), ('A', 1, 351)), ((1, 1), ('A', 1, 351)), ((1, 2), ('A', 1, 351)), ((1, 3), ('A', 1, 351)), ((1, 4), ('A', 1, 351)), ((2, 0), ('A', 0, 178)), ((2, 1), ('A', 0, 178)), ((2, 2), ('A', 0, 178)), ((2, 3), ('A', 0, 178)), ((2, 4), ('A', 0, 178)), ((2, 0), ('A', 1, 692)), ((2, 1), ('A', 1, 692)), ((2, 2), ('A', 1, 692)), ((2, 3), ('A', 1, 692)), ((2, 4), ('A', 1, 692)), ((3, 0), ('A', 0, 936)), ((3, 1), ('A', 0, 936)), ((3, 2), ('A', 0, 936)), ((3, 3), ('A', 0, 936)), ((3, 4), ('A', 0, 936)), ((3, 0), ('A', 1, 149)), ((3, 1), ('A', 1, 149)), ((3, 2), ('A', 1, 149)), ((3, 3), ('A', 1, 149)), ((3, 4), ('A', 1, 149))]
elems2 => [((0, 0), ('B', 0, 573)), ((1, 0), ('B', 0, 573)), ((2, 0), ('B', 0, 573)), ((3, 0), ('B', 0, 573)), ((4, 0), ('B', 0, 573)), ((0, 0), ('B', 1, 324)), ((1, 0), ('B', 1, 324)), ((2, 0), ('B', 1, 324)), ((3, 0), ('B', 1, 324)), ((4, 0), ('B', 1, 324))]
其中((0, 0), ('B', 0, 573)), (0, 0)
是键,而('B', 0, 573)
是值.
在Google快速搜索后,我发现这是一个仅在spark 1.2中出现的问题,但是我使用了Spark 1.5
After a quick google search, I found that it is a problem which only occurs in spark 1.2, however I have used Spark 1.5
推荐答案
为什么不只使用elems1.join(elems2)
Why not just use elems1.join(elems2)
这篇关于无法成对使用不同数量的项目对RDD进行反序列化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!