本文介绍了如何使用 Spark 执行插入覆盖?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将我们的一个 ETL Hive 脚本转换为 Spark,其中 Hive ETL 脚本维护一个表,其中每晚需要在新同步之前删除部分数据.Hive ETL 使用插入覆盖将主表删除超过 3 天的数据.基本上创建一个临时表,其中的数据不超过三天,然后覆盖主表.

I'm trying to transition one of our ETL Hive script to Spark where the Hive ETL script maintains a table where part of data needs to be deleted every night before the new sync. The Hive ETL takes the main table deletes data that in greater than 3 days using insert overwrite. Basically creates a temp table with data that doesn't surpass greater than three days and then overwrites the main table.

使用 Spark(使用 Scala)时,我不断收到此错误,无法写入相同的源.这是我的代码:

With Spark (using Scala) I keep getting this error where I cannot write to the same source. Here's my code:

spark.sql ("Select * from mytbl_hive where dt > date_sub(current_date, 3)").registerTempTable("tmp_mytbl")

val mytbl = sqlContext.table("tmp_mytbl")
mytbl.write.mode("overwrite").saveTableAs("tmp_mytbl")

//writing back to Hive ...

mytbl.write.mode("overwrite").insertInto("mytbl_hive")

我收到了无法写入正在读取的表的错误.

I get the error that I cannot write to the table I'm reading from.

有没有人知道这样做的更好方法?

Does anyone know of a better way of doing this?

推荐答案

你不能.正如您所了解的,Spark 明确禁止覆盖用作查询源的表.虽然根据技术细节存在一些变通方法,但并不可靠,应避免使用.

You cannot. As you've learned Spark explicitly prohibits overwriting table, which is used as a source for the query. While there exist some workarounds depending on the technicalities, there are not reliable and should be avoided.

相反:

  • 将数据写入临时表.
  • 放下旧桌子.
  • 重命名临时表.

Hive ETL 使用插入覆盖将主表删除超过 3 天的数据.

最好按日期对数据进行分区,然后删除分区,甚至不查看数据.

It might a better idea to partition data by date, and just drop partitions, without even looking at the data.

这篇关于如何使用 Spark 执行插入覆盖?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 04:41