问题描述
我对 spark dataframe 分区数有疑问.
I have question on spark dataframe number of partitions.
如果我有包含列(名称、年龄、id、位置)的 Hive 表(员工).
If I have Hive table(employee) which has columns (name,age,id,location).
CREATE TABLE employee (name String, age String, id Int) PARTITIONED BY (location String);
如果员工表有 10 个不同的位置.所以数据会在 HDFS 中被划分为 10 个分区.
If the employee table has 10 different locations. So data will be partitioned into 10 partitions in HDFS.
如果我通过读取 Hive 表(员工)的全部数据来创建 Spark 数据帧(df).
If I create a Spark dataframe(df) by reading the whole data of a Hive table(employee).
Spark 将为一个数据帧(df)创建多少个分区?
How many number of partitions will be created by Spark for a dataframe(df)?
df.rdd.partitions.size = ??
df.rdd.partitions.size = ??
推荐答案
根据 HDFS 的块大小创建分区.
Partitions are created depending on the block size of HDFS.
假设您已将 10 个分区读取为单个 RDD,如果块大小为 128MB,则
Imagine you have read the 10 partitions as a single RDD and if the block size is 128MB then
分区数 =(大小为(10 个分区,以 MB 为单位))/128MB
no of partitions = (size of(10 partitions in MBs)) / 128MB
将存储在 HDFS 上.
will be stored on HDFS.
请参考以下链接:
http://www.bigsynapse.com/spark-input-output
这篇关于通过从 Hive 表中读取数据创建的 spark 数据帧的分区数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!