问题描述
Hadoop 中拆分大小和块大小之间的关系是什么?正如我在 this 中读到的那样,分割大小必须是块大小的 n 倍(n 是整数,n > 0),这是正确的吗?分割大小和块大小有必然关系吗?
What is relationship between split size and block size in Hadoop? As I read in this, split size must be n-times of block size (n is an integer and n > 0), is this correct? Is there any must in relationship between split size and block size?
推荐答案
在 HDFS 架构中有一个块的概念.HDFS 使用的典型块大小为 64 MB.当我们将一个大文件放入 HDFS 时,它会被切成 64 MB 的块(基于块的默认配置),假设您有一个 1GB 的文件并且您想将该文件放入 HDFS,那么将会有 1GB/64MB =16 个拆分/块,这些块将分布在 DataNode 中.根据您的集群配置,这些块/块将驻留在不同的不同 DataNode 上.
In HDFS architecture there is a concept of blocks. A typical block size used by HDFS is 64 MB. When we place a large file into HDFS it chopped up into 64 MB chunks(based on default configuration of blocks), Suppose you have a file of 1GB and you want to place that file in HDFS, then there will be 1GB/64MB =16 split/blocks and these block will be distribute across the DataNodes. These blocks/chunk will reside on a different different DataNode based on your cluster configuration.
数据拆分基于文件偏移量发生.拆分文件并将其存储到不同块的目标是并行处理和数据故障转移.
Data splitting happens based on file offsets. The goal of splitting of file and store it into different blocks, is parallel processing and fail over of data.
块大小和分割大小之间的差异.
Split 是数据的逻辑拆分,主要用于在使用 Map/Reduce 程序或 Hadoop 生态系统上的其他数据处理技术进行数据处理期间.拆分大小是用户定义的值,您可以根据自己的数据量(您正在处理的数据量)选择自己的拆分大小.
Split is logical split of the data, basically used during data processing using Map/Reduce program or other dataprocessing techniques on Hadoop Ecosystem. Split size is user defined value and you can choose your own split size based on your volume of data(How much data you are processing).
Split 主要用于控制 Map/Reduce 程序中 Mapper 的数量.如果您没有在 Map/Reduce 程序中定义任何输入拆分大小,那么默认的 HDFS 块拆分将被视为输入拆分.
Split is basically used to control number of Mapper in Map/Reduce program. If you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split.
示例:
假设你有一个 100MB 的文件,HDFS 默认的块配置是 64MB,那么它会被分成 2 份,占用 2 个块.现在您有一个 Map/Reduce 程序来处理此数据,但您尚未指定任何输入拆分,然后根据块的数量(2 个块)输入拆分将被考虑用于 Map/Reduce 处理,并且将为此分配 2 个映射器工作.
Suppose you have a file of 100MB and HDFS default block configuration is 64MB, then it will chopped in 2 split and occupy 2 blocks. Now you have a Map/Reduce program to process this data but you have not specified any input split then based on the number of blocks(2 block) input split will be considered for the Map/Reduce processing and 2 mapper will get assigned for this job.
但是假设您在 Map/Reduce 程序中指定了拆分大小(比如 100MB),那么两个块(2 个块)将被视为 Map/Reduce 处理的单个拆分,并且将为此分配 1 个 Mapper工作.
But suppose, you have specified the split size(say 100MB) in your Map/Reduce program then both blocks(2 block) will be considered as a single split for the Map/Reduce processing and 1 Mapper will get assigned for this job.
假设您在 Map/Reduce 程序中指定了拆分大小(比如 25MB),那么 Map/Reduce 程序将有 4 个输入拆分,并且将为该作业分配 4 个 Mapper.
Suppose, you have specified the split size(say 25MB) in your Map/Reduce program then there will be 4 input split for the Map/Reduce program and 4 Mapper will get assigned for the job.
结论:
- 分割是输入数据的逻辑分割,而块是数据的物理分割.
- 如果未指定输入拆分,则 HDFS 默认块大小为默认拆分大小.
- 拆分是用户定义的,用户可以在他的 Map/Reduce 程序中控制拆分的大小.
- 一个分割可以映射到多个块,一个块可以有多个分割.
- 地图任务(Mapper)的数量等于拆分的数量.
这篇关于Hadoop 中的拆分大小与块大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!