

即使是Hive表或HDFS文件,当Spark读取数据并创建数据框时,我仍认为RDD/数据框中的分区数将等于HDFS中的部分文件数.但是,当我使用Hive外部表进行测试时,我发现该数字与部件文件的数量有所不同.数据帧中的分区数为119.该表是其中包含150个部件文件的Hive分区表.,文件的最小大小为30 MB,最大大小为118 MB.那么,什么决定分区的数量呢?

Even if it is a Hive table or an HDFS file, when Spark reads the data and creates a dataframe, I was thinking that the number of partitions in the RDD/dataframe will be equal to the number of partfiles in HDFS. But when I did a test with Hive external table, I could see that the number was coming different than the number of part-files .The number of partitions in a dataframe was 119. The table was a Hive partitioned table with 150 partfiles in it, with a minimum size of a file 30 MB and max size is 118 MB. So then what decides the number of partitions?


您可以通过设置 spark.sql.files.maxPartitionBytes 来控制Spark打包到单个分区中的字节数.默认值为128 MB,请参见火花调整.

You can control how many bytes Spark packs into a single partition by setting spark.sql.files.maxPartitionBytes. The default value is 128 MB, see Spark Tuning.


07-29 15:47