问题描述
当我在所有任务成功后将数据帧中的数据写入镶木地板表(已分区)时,进程卡在更新分区统计信息上.
When I write data from dataframe into parquet table ( which is partitioned ) after all the tasks are successful, process is stuck at updating partition stats.
16/10/05 03:46:13 WARN log: Updating partition stats fast for:
16/10/05 03:46:14 WARN log: Updated size to 143452576
16/10/05 03:48:30 WARN log: Updating partition stats fast for:
16/10/05 03:48:31 WARN log: Updated size to 147382813
16/10/05 03:51:02 WARN log: Updating partition stats fast for:
df.write.format("parquet").mode("overwrite").partitionBy(part1).insertInto(db.tbl)
我的表有 > 400 列和 > 1000 个分区.如果我们可以优化和加速更新分区统计信息,请告诉我.
My table has > 400 columns and > 1000 partitions.Please let me know if we can optimize and speedup updating partition stats.
推荐答案
我觉得这里的问题是对于大于 400 列的文件有太多的分区.每次覆盖 hive 中的表时,统计信息都会更新.在您的情况下,它将尝试更新 1000 个分区的统计信息,并且每个分区的数据超过 400 列.
I feel the problem here is there are too many partitions for a > 400 columns file. Every time you overwrite a table in hive , the statistics are updated. IN your case it will try to update statistics for 1000 partitions and again each partition has data with > 400 columns.
尝试减少分区数(使用另一个分区列,或者如果它是日期列,请考虑按月分区),您应该能够看到性能的显着变化.
Try reducing the number of partitions (use another partition column or if it is a date column consider partitioning by month) and you should be able to see a significant change in performance.
这篇关于Spark Data Frame 写入镶木地板表 - 更新分区统计数据很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!