本文介绍了Hadoop中的拆分大小与块大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Hadoop 中的分割大小和块大小之间有什么关系?正如我在 this 中读到的,拆分大小必须是块大小的 n 倍(n 是整数且 n > 0),这是正确的吗?分割大小和块大小之间有什么必然的关系吗?

What is relationship between split size and block size in Hadoop? As I read in this, split size must be n-times of block size (n is an integer and n > 0), is this correct? Is there any must in relationship between split size and block size?

推荐答案

在HDFS架构中有块的概念.HDFS 使用的典型块大小为 64 MB.当我们将一个大文件放入 HDFS 时,它被分成 64 MB 的块(基于块的默认配置),假设您有一个 1GB 的文件,并且您想将该文件放入 HDFS,那么将有 1GB/64MB =16 个拆分/块,这些块将分布在 DataNode 上.根据您的集群配置,这些块/块将驻留在不同的不同 DataNode 上.

In HDFS architecture there is a concept of blocks. A typical block size used by HDFS is 64 MB. When we place a large file into HDFS it chopped up into 64 MB chunks(based on default configuration of blocks), Suppose you have a file of 1GB and you want to place that file in HDFS, then there will be 1GB/64MB =16 split/blocks and these block will be distribute across the DataNodes. These blocks/chunk will reside on a different different DataNode based on your cluster configuration.

根据文件偏移量进行数据拆分.将文件拆分并将其存储到不同块中的目标是数据的并行处理和故障转移.

Data splitting happens based on file offsets. The goal of splitting of file and store it into different blocks, is parallel processing and fail over of data.

块大小和分割大小的区别.

Split 是数据的逻辑拆分,主要用于在 Hadoop 生态系统上使用 Map/Reduce 程序或其他数据处理技术进行数据处理.拆分大小是用户定义的值,您可以根据数据量(您正在处理的数据量)选择自己的拆分大小.

Split is logical split of the data, basically used during data processing using Map/Reduce program or other dataprocessing techniques on Hadoop Ecosystem. Split size is user defined value and you can choose your own split size based on your volume of data(How much data you are processing).

Split 主要用于控制 Map/Reduce 程序中 Mapper 的数量.如果您没有在 Map/Reduce 程序中定义任何输入拆分大小,则默认的 HDFS 块拆分将被视为输入拆分.

Split is basically used to control number of Mapper in Map/Reduce program. If you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split.

示例:

假设你有一个 100MB 的文件,HDFS 默认的块配置是 64MB,那么它会被砍成 2 块,占用 2 个块.现在您有一个 Map/Reduce 程序来处理此数据,但您尚未指定任何输入拆分,然后根据块数(2 个块)输入拆分将被考虑用于 Map/Reduce 处理,并为此分配 2 个映射器工作.

Suppose you have a file of 100MB and HDFS default block configuration is 64MB, then it will chopped in 2 split and occupy 2 blocks. Now you have a Map/Reduce program to process this data but you have not specified any input split then based on the number of blocks(2 block) input split will be considered for the Map/Reduce processing and 2 mapper will get assigned for this job.

但是假设,您在 Map/Reduce 程序中指定了拆分大小(例如 100MB),那么两个块(2 个块)将被视为 Map/Reduce 处理的单个拆分,并为此分配 1 个 Mapper工作.

But suppose, you have specified the split size(say 100MB) in your Map/Reduce program then both blocks(2 block) will be considered as a single split for the Map/Reduce processing and 1 Mapper will get assigned for this job.

假设,您在 Map/Reduce 程序中指定了拆分大小(例如 25MB),那么 Map/Reduce 程序将有 4 个输入拆分,并且将为该作业分配 4 个 Mapper.

Suppose, you have specified the split size(say 25MB) in your Map/Reduce program then there will be 4 input split for the Map/Reduce program and 4 Mapper will get assigned for the job.

结论:

  1. 分割是输入数据的逻辑划分,而块是数据的物理划分.
  2. 如果未指定输入拆分,则 HDFS 默认块大小为默认拆分大小.
  3. 拆分是用户定义的,用户可以在其 Map/Reduce 程序中控制拆分大小.
  4. 一个拆分可以映射到多个块,一个块可以有多个拆分.
  5. map任务(Mapper)的数量等于split的数量.

这篇关于Hadoop中的拆分大小与块大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 10:07