hadoop - 如何确定Hadoop中正确的映射器数量？

我为Hadoop程序提供了大小为4MB的输入文件(具有10万条记录)。由于每个HDFS块为64 MB，并且文件仅适合一个块，因此我将映射器的数量选择为1。但是，当我增加映射器的数量(让我们坐到24个)时，运行时间变得更好了。我不知道为什么会这样？因为所有文件只能由一个映射器读取。

该算法的简要说明:使用configure函数从DistributeCache中读取集群，并将其存储在名为clusters的全局变量中。映射器逐行读取每个块，并找到每个行所属的集群。以下是一些代码:

public void configure(JobConf job){
        //retrieve the clusters from DistributedCache
        try {
            Path[] eqFile = DistributedCache.getLocalCacheFiles(job);
            BufferedReader reader = new BufferedReader(new FileReader(eqFile[0].toString()));


            while((line=reader.readLine())!=null){
                //construct the cluster represented by ``line`` and add it to a global variable called ``clusters``

                }


            reader.close();

        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

和映射器

 public void map(LongWritable key, Text value, OutputCollector<IntWritable, EquivalenceClsAggValue> output, Reporter reporter) throws IOException {
         //assign each record to one of the existing clusters in ``clusters''.

        String record = value.toString();
        EquivalenceClsAggValue outputValue = new EquivalenceClsAggValue();
        outputValue.addRecord(record);
        int eqID = MondrianTree.findCluster(record, clusters);
        IntWritable outputKey = new IntWritable(eqID);
        output.collect(outputKey,outputValue);
    }

我有不同大小的输入文件(从4 MB到4GB)。如何找到映射器/缩减器的最佳数量？ Hadoop集群中的每个节点都有2个核心，而我有58个节点。

最佳答案

事实并非如此。需要牢记的几点...

单个块被复制3次(默认情况下)，这意味着三个单独的节点无需访问网络即可访问同一块

没有理由不能将一个块复制到多台计算机上，然后在这些计算机上进行拆分，然后分配给它们

关于hadoop - 如何确定Hadoop中正确的映射器数量？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/16972589/