问题描述
想知道是否有人知道此警告信息
Just wonder if anyone is aware of this warning info
18/01/10 19:52:56 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints
(spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance
当尝试将具有许多分区的大数据帧从S3加载到spark时,我已经看到很多.
I've seen this a lot when trying to load some big dataframe with many partitions from S3 into spark.
它从来没有真正对工作造成任何问题,只是想知道该config属性的用途是什么,以及如何对其进行适当的调整.
It never really causes any issues to the job, just wonder what is the use of that config property and how to tune it properly.
谢谢
推荐答案
在回答您的问题时,这是spark-hive特定的配置属性,当非零时,可以在内存中缓存分区文件元数据.所有表共享一个高速缓存,该高速缓存最多可以使用指定的num个字节来存储文件元数据.此conf仅在启用配置单元文件源分区管理时才有效.
In answer to your question, this is a spark-hive specific config property which, when nonzero, enable caching of partition file metadata in memory. All tables share a cache that can use up to specified num bytes for file metadata. This conf only has an effect when hive filesource partition management is enabled.
在spark源代码中,其编写方式如下所示.根据代码的默认大小为250 * 1024 * 1024,您可以尝试在代码中/在spark-submit命令中使用SparkConf对象来操纵该大小.
In spark source code it is written like the following. The default size is 250 * 1024 * 1024 as per code which you can try to manipulate by your SparkConf object in your code/in spark-submit command.
火花源代码
val HIVE_FILESOURCE_PARTITION_FILE_CACHE_SIZE =
buildConf("spark.sql.hive.filesourcePartitionFileCacheSize")
.doc("When nonzero, enable caching of partition file metadata in memory. All tables share " +
"a cache that can use up to specified num bytes for file metadata. This conf only " +
"has an effect when hive filesource partition management is enabled.")
.longConf
.createWithDefault(250 * 1024 * 1024)
这篇关于spark.sql.hive.filesourcePartitionFileCacheSize的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!