map-reduce有点新,所以如果有人可以用以下问题指导我,那将是很棒的

  • 我使用多输出格式来写入以在map reduce中分离输出文件。假设我的输入文件包含“水果和蔬菜”,因此将其拆分为两个文件。水果和蔬菜如下。

    水果r-00000,蔬菜r-00000,Part-r-00000

    对要运行多少个 reducer 感到困惑?我知道默认情况下,reducer的数量设置为1,并且由于文件名的数量部分相同,所以我相信只有一个reducer运行。我的理解正确吗?
    为什么还要创建part-r-00000文件?我将所有输出写入“水果”文件或“蔬菜”文件中。
  • 如果我有1 GB的数据要处理,我将如何决定要使用的最佳 reducer 数量?
  • 最佳答案

    one reducer will run ,it has nothing to do with part of file name , no of reducer would be either specified by the user by default it calculated the size of the input file and amount of work which need to be done in reducers .
    
    part-r-00000 : This is related with partitioning, Since we have one reducer so all partitions will point to this file
    
    Number of reduces in most cases specified by users. It mostly depends on amount of work, which need to be done in reducers. But their number should not be very big, because of algorithm, used by Mapper to distribute data among reducers. Some frameworks, like Hive can calculate number of reducers using empirical 1GB output per reducer.
    

    关于hadoop - MultitpleOutputFormat-Hadoop,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/26232216/

    10-12 20:32