如何确定对象是否是 PySpark 中的有效键值对

本文介绍了如何确定对象是否是 PySpark 中的有效键值对的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如果我有一个 rdd，我如何理解数据在 key:value 中格式?有没有办法找到相同的东西——比如type(object) 告诉我一个对象的类型.我试过 打印type(rdd.take(1))，但它只是说.
假设我有一个类似 (x,1),(x,2),(y,1),(y,3) 的数据，我使用groupByKey 并得到 (x,(1,2)),(y,(1,3)).有没有办法定义(1,2) 和 (1,3) 作为值，其中 x 和 y 是键?还是键必须是单个值?我注意到如果我使用 reduceByKey 和 sum 函数来获取数据 ((x,3),(y,4)) 那么它将此数据定义为键值对变得更加容易

If I have a rdd, how do I understand the data is in key:valueformat? is there a way to find the same - something liketype(object) tells me an object's type. I tried printtype(rdd.take(1)), but it just says <type 'list'>.
Let's say I have a data like (x,1),(x,2),(y,1),(y,3) and I usegroupByKey and got (x,(1,2)),(y,(1,3)). Is there a way to define(1,2) and (1,3) as values where x and y are keys? Or does a key has to be a single value? I noted that if I use reduceByKey and sum function to get the data ((x,3),(y,4)) then it becomes much easier to define this data as a key-value pair

推荐答案

Python 是一种动态类型语言，PySpark 不使用任何特殊类型的键值对.对象被视为PairRDD 操作的有效数据的唯一要求是它可以按如下方式解包:

Python is a dynamically typed language and PySpark doesn't use any special type for key, value pairs. The only requirement for an object being considered a valid data for PairRDD operations is that it can be unpacked as follows:

k, v = kv

通常您会使用两个元素 tuple，因为它的语义(固定大小的不可变对象)和与 Scala Product 类的相似性.但这只是一个约定，没有什么能阻止你做这样的事情:

Typically you would use a two element tuple due to its semantics (immutable object of fixed size) and similarity to Scala Product classes. But this is just a convention and nothing stops you from something like this:

key_value.py

class KeyValue(object):
    def __init__(self, k, v):
        self.k = k
        self.v = v
    def __iter__(self):
       for x in [self.k, self.v]:
           yield x

from key_value import KeyValue

rdd = sc.parallelize(
    [KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)])

rdd.reduceByKey(add).collect()
## [('bar', 0), ('foo', 3)]

并使任意类表现得像一个键值.因此，如果某些东西可以正确地解包为一对对象，那么它就是一个有效的键值.实现 __len__ 和 __getitem__ 魔术方法应该也能工作.处理这个问题最优雅的方法可能是使用 namedtuples.

and make an arbitrary class behave like a key-value. So once again if something can be correctly unpacked as a pair of objects then it is a valid key-value. Implementing __len__ and __getitem__ magic methods should work as well. Probably the most elegant way to handle this is to use namedtuples.

还有 type(rdd.take(1)) 返回长度为 n 的 list 所以它的类型将始终相同.

Also type(rdd.take(1)) returns a list of length n so its type will be always the same.

这篇关于如何确定对象是否是 PySpark 中的有效键值对的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！