

Hadoop 是否根据程序中设置的映射器数量来拆分数据?也就是说,有一个大小为 500MB 的数据集,如果 mapper 的数量是 200 个(假设 Hadoop 集群允许同时有 200 个 mapper),每个 mapper 是否给了 2.5 MB 的数据?

Does the Hadoop split the data based on the number of mappers set in the program? That is, having a data set of size 500MB, if the number of mappers is 200 (assuming that the Hadoop cluster allows 200 mappers simultaneously), is each mapper given 2.5 MB of data?


Besides,do all the mappers run simultaneously or some of them might get run in serial?


我刚刚根据您的问题运行了一个示例 MR 程序,这是我的发现

I just ran a sample MR program based on your question and here is my finding


Input: a file smaller that block size.

案例 1:映射器数量 =1 结果:启动了 1 个映射任务.输入拆分每个映射器的大小(在这种情况下只有一个)与输入文件相同大小.

案例 2:映射器数量 = 5 结果:启动了 5 个映射任务.每个映射器的 Inputsplit 大小是输入文件大小的五分之一.

Case 2: Number of mappers = 5 Result : 5 map tasks launched. Inputsplit size for each mapper is one fifth of the input file size.

案例 3:映射器数量 = 10 结果:启动了 10 个映射任务.每个映射器的 Inputsplit 大小是输入文件大小的十分之一.

Case 3: Number of mappers = 10 Result : 10 map tasks launched. Inputsplit size for each mapper is one 10th of the input file size.


So based on above, for file less then block size,

分割大小 = 输入文件总大小/启动的地图任务数.



05-29 03:26