通过从 Hive 表中读取数据创建的 spark 数据帧的分区数 | will

will

带有 Jersey 2.2 和 Jackson 2.1 的自定义 ObjectMapper

固定装置会触发模型回调吗?

计算两个日期时间值之间的小时差

Mongodb更新有限数量的文档

url django 中渲染的 slugs 不正确

使用 NEST 索引动态对象

MySQL 查询给出重复输入错误 1062

为什么 NaN^0 == 1

Parse.com保存对象一次

为什么 kafka 不创建主题?bootstrap-server 不是公认的选项

解析本地数据存储+网络同步

Kafka 最优保留和删除策略

为什么我的 Java 使用者不会读取我创建的数据?

如何使用Swift 4解析JSON

bash-Shell脚本打开多个终端并执行不同的命令

通过从 Hive 表中读取数据创建的 spark 数据帧的分区数

扫码查看

本文介绍了通过从 Hive 表中读取数据创建的 spark 数据帧的分区数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对 spark dataframe 分区数有疑问.

I have question on spark dataframe number of partitions.

如果我有包含列(名称、年龄、id、位置)的 Hive 表(员工).

If I have Hive table(employee) which has columns (name,age,id,location).

CREATE TABLE employee (name String, age String, id Int) PARTITIONED BY (location String);

如果员工表有 10 个不同的位置.所以数据会在 HDFS 中被划分为 10 个分区.

If the employee table has 10 different locations. So data will be partitioned into 10 partitions in HDFS.

如果我通过读取 Hive 表(员工)的全部数据来创建 Spark 数据帧(df).

If I create a Spark dataframe(df) by reading the whole data of a Hive table(employee).

Spark 将为一个数据帧(df)创建多少个分区?

How many number of partitions will be created by Spark for a dataframe(df)?

df.rdd.partitions.size = ??

df.rdd.partitions.size = ??

推荐答案

根据 HDFS 的块大小创建分区.

Partitions are created depending on the block size of HDFS.

假设您已将 10 个分区读取为单个 RDD，如果块大小为 128MB，则

Imagine you have read the 10 partitions as a single RDD and if the block size is 128MB then

分区数 =(大小为(10 个分区，以 MB 为单位))/128MB

no of partitions = (size of(10 partitions in MBs)) / 128MB

将存储在 HDFS 上.

will be stored on HDFS.

请参考以下链接:

http://www.bigsynapse.com/spark-input-output

这篇关于通过从 Hive 表中读取数据创建的 spark 数据帧的分区数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

07-29 13:16