本文介绍了PySpark插入覆盖问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面是PySpark ETL代码的最后两行:

Below are the last 2 lines of the PySpark ETL code:

df_writer = DataFrameWriter(usage_fact)
df_writer.partitionBy("data_date", "data_product").saveAsTable(usageWideFactTable, format=fileFormat,mode=writeMode,path=usageWideFactpath)

在哪里,WriteMode =追加,fileFormat = orc

Where, WriteMode= append and fileFormat=orc

我想使用插入覆盖来代替它,以便在我重新运行代码时不会附加我的数据.因此,我使用了这个:

I wanted to use insert overwrite in place of this so that my data is not getting appended when I re-run the code. Hence I have used this:

usage_fact.createOrReplaceTempView("usage_fact")
fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from  usage_fact")

但这给了我下面的错误:

But this is giving me below error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 545, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Cannot overwrite a path that is also being read from.;'

好像我无法覆盖我正在阅读的路径,但是由于我是PySpark的新手,所以不知道如何纠正它.我应该使用什么确切的代码来解决此问题?

Looks like I cannot overwrite a path from where I am reading from but don't know how to rectify it as I am new to PySpark.What exact code I should use so that this issue is removed?

推荐答案

它对我来说与上面的代码相同.我只是对DDL进行了更改,并使用以下详细信息重新创建了该表:(如果使用,则删除属性)

It worked for me with same above code. I just made change in the DDL and recreated the table with below details: (Removed Properties, if used)

PARTITIONED BY (
`data_date` string,
`data_product` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='s3://saasdata/datawarehouse/fact/UsageFact/')
 STORED AS INPUTFORMAT
 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
 OUTPUTFORMAT
 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
 's3://saasdata/datawarehouse/fact/UsageFact/'

这篇关于PySpark插入覆盖问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-21 17:56