在多个CPU内核上运行独立Hadoop应用程序

在多个CPU内核上运行独立Hadoop应用程序

本文介绍了在多个CPU内核上运行独立Hadoop应用程序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的团队使用Hadoop库构建了一个Java应用程序,将一堆输入文件转换为有用的输出。
给定当前负载,单个多核服务器将在未来一年左右做好。我们还没有需要去一个多服务器Hadoop集群,但我们选择开始这个项目准备。

My team built a Java application using the Hadoop libraries to transform a bunch of input files into useful output.Given the current load a single multicore server will do fine for the coming year or so. We do not (yet) have the need to go for a multiserver Hadoop cluster, yet we chose to start this project "being prepared".

当我运行这个应用程序命令行(或在eclipse或netbeans),我还不能说服它使用更多的一个地图和/或减少线程一次。
由于这个工具是非常CPU密集型的,所以单线程是我目前的瓶颈。

When I run this app on the command-line (or in eclipse or netbeans) I have not yet been able to convince it to use more that one map and/or reduce thread at a time.Given the fact that the tool is very CPU intensive this "single threadedness" is my current bottleneck.

当在netbeans分析器中运行它时,应用程序为了各种目的启动多个线程,但只有一个map / reduce正在同一时间运行。

When running it in the netbeans profiler I do see that the app starts several threads for various purposes, but only a single map/reduce is running at the same moment.

输入数据由几个输入文件组成,因此Hadoop应该最少能够在映射阶段同时为每个输入文件运行1个线程。

The input data consists of several input files so Hadoop should at least be able to run 1 thread per input file at the same time for the map phase.

我可以做什么至少有2甚至4个活动线程运行在这个应用程序的大部分处理时间应该是可能的)

What do I do to at least have 2 or even 4 active threads running (which should be possible for most of the processing time of this application)?

我希望这是一个非常愚蠢的,我忽略了。

I'm expecting this to be something very silly that I've overlooked.

我发现这个:
这实现了我在Hadoop中寻找的功能0.21
它介绍flag mapreduce.local.map.tasks.maximum来控制它。

I just found this: https://issues.apache.org/jira/browse/MAPREDUCE-1367This implements the feature I was looking for in Hadoop 0.21It introduces the flag mapreduce.local.map.tasks.maximum to control it.

现在我也找到了解决方案。

For now I've also found the solution described here in this question.

推荐答案

我不确定我是否正确,但是当您在本地模式下运行任务时,有多个mappers / reducers。

I'm not sure if I'm correct, but when you are running tasks in local mode, you can't have multiple mappers/reducers.

无论如何,要设置最大数量的运行mappers和reducers使用配置选项 mapred.tasktracker.map.tasks.maximum 和 mapred.tasktracker.reduce.tasks.maximum 默认情况下,这些选项设置为 2 ,所以我可能是对的。

Anyway, to set maximum number of running mappers and reducers use configuration options mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum by default those options are set to 2, so I might be right.

最后,如果你想准备多节点集群,直接运行这个完全分布式的方式,但有所有服务器namenode,datanode,tasktracker,jobtracker,...)在单个机器上运行

Finally, if you want to be prepared for multinode cluster go straight with running this in fully-distributed way, but have all servers (namenode, datanode, tasktracker, jobtracker, ...) run on a single machine

这篇关于在多个CPU内核上运行独立Hadoop应用程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 22:32