问题描述
Pyspark的文档显示了由sqlContext
,sqlContext.read()
和多种其他方法构造的DataFrame.
The documentation for Pyspark shows DataFrames being constructed from sqlContext
, sqlContext.read()
, and a variety of other methods.
(请参见 https://spark.apache .org/docs/1.6.2/api/python/pyspark.sql.html )
是否可以继承Dataframe并单独实例化它?我想向基本DataFrame类添加方法和功能.
Is it possible to subclass Dataframe and instantiate it independently? I would like to add methods and functionality to the base DataFrame class.
推荐答案
这确实取决于您的目标.
It really depends on your goals.
-
从技术上讲,这是可能的.
pyspark.sql.DataFrame
只是一个普通的Python类.您可以根据需要扩展它或猴子补丁.
Technically speaking it is possible.
pyspark.sql.DataFrame
is just a plain Python class. You can extend it or monkey-patch if you need.
from pyspark.sql import DataFrame
class DataFrameWithZipWithIndex(DataFrame):
def __init__(self, df):
super(self.__class__, self).__init__(df._jdf, df.sql_ctx)
def zipWithIndex(self):
return (self.rdd
.zipWithIndex()
.map(lambda row: (row[1], ) + row[0])
.toDF(["_idx"] + self.columns))
示例用法:
df = sc.parallelize([("a", 1)]).toDF(["foo", "bar"])
with_zipwithindex = DataFrameWithZipWithIndex(df)
isinstance(with_zipwithindex, DataFrame)
True
with_zipwithindex.zipWithIndex().show()
+----+---+---+
|_idx|foo|bar|
+----+---+---+
| 0| a| 1|
+----+---+---+
实际上,您将无法在这里做很多事情. DataFrame
是围绕JVM对象的一个瘦包装器,除了提供文档字符串,将参数转换为本地所需的形式,调用JVM方法并在必要时使用Python适配器包装结果之外,没有做太多其他事情.
Practically speaking you won't be able to do much here. DataFrame
is an thin wrapper around JVM object and doesn't do much beyond providing docstrings, converting arguments to the form required natively, calling JVM methods, and wrapping the results using Python adapters if necessary.
使用普通的Python代码,您甚至将无法访问DataFrame
/Dataset
内部或修改其核心行为.如果您正在寻找独立的,仅Python的Spark DataFrame
实现,则不可能.
With plain Python code you won't be able to even go near DataFrame
/ Dataset
internals or modify its core behavior. If you're looking for standalone, Python only Spark DataFrame
implementation it is not possible.
这篇关于是否可以在Pyspark中继承DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!