问题描述
Pyspark 不允许我创建存储桶.
Pyspark does not allow me to create bucket.
(
df
.write
.partitionBy('Source')
.bucketBy(8,'destination')
.saveAsTable('flightdata')
)
AttributeError Traceback(最近一次调用最后一次)在 ()----> 1 df.write.bucketBy(2,"Source").saveAsTable("table")
AttributeError Traceback (most recent call last) in ()----> 1 df.write.bucketBy(2,"Source").saveAsTable("table")
AttributeError: 'DataFrameWriter' 对象没有属性 'bucketBy'
AttributeError: 'DataFrameWriter' object has no attribute 'bucketBy'
推荐答案
看来 bucketBy
只在 spark 2.3.0 中支持
https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/readwriter.html#DataFrameWriter.bucketBy
It looks like bucketBy
is only supported in spark 2.3.0
https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/readwriter.html#DataFrameWriter.bucketBy
您可以尝试创建一个新的存储桶列
You could try creating a new bucket column
from pyspark.ml.feature import Bucketizer
bucketizer = Bucketizer(splits=[ 0, float('Inf') ],inputCol="destination", outputCol="buckets")
df_with_buckets = bucketizer.setHandleInvalid("keep").transform(df)
然后使用 partitionBy(*cols)
df_with_buckets.write.partitionBy('buckets').saveAsTable("table")
这篇关于Pyspark 不允许我创建存储桶的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!