java - Hadoop集群卡死在Reduce>复制>

到目前为止，对于此问题，我已经在此处1和此处在2中尝试了的解决方案。但是，尽管这些解决方案确实导致执行mapreduce任务，但随着我得到类似于3的输出，它们似乎仅在名称节点上运行。

基本上，我正在运行具有我自己设计的mapreduce算法的 2节点群集。 mapreduce jar在单个节点群集上完美地执行了，这使我认为我的hadoop多节点配置出现了错误。要设置多节点，我遵循了here 教程。

为了报告发生了什么问题，当我执行程序时(检查了namenode，tasktrackers，jobtrackers和Datanodes是否在各自的节点上运行之后)，我的程序在终端中用以下行暂停:
INFO mapred.JobClient: map 100% reduce 0%
如果查看任务的日志，我会看到copy failed: attempt... from slave-node和SocketTimeoutException。

看一下我的从节点(DataNode)上的日志，显示执行停止在下面的行:
TaskTracker: attempt... 0.0% reduce > copy >
正如链接1和2中的解决方案所建议的，从etc/hosts文件中删除各种ip地址会导致成功执行，但是我最终在从属节点(DataNode)日志中遇到了链接4中的项:
INFO org.apache.hadoop.mapred.TaskTracker: Received 'KillJobAction'for job: job_201201301055_0381WARN org.apache.hadoop.mapred.TaskTracker: Unknown job job_201201301055_0381being deleted.
在我看来，像一样，是新的hadoop用户，但看到这可能是完全正常的。在我看来，这似乎是指向主机文件中的错误IP地址，并且通过删除该IP地址，我只是暂停了从属节点上的执行，然后在namenode上继续处理而不是(根本没有优势)。

总结一下:

是否期望此输出？

有什么方法可以查看在哪个节点上执行后执行了什么？

有人可以发现我做错的任何事情吗？

编辑为每个节点添加的主机和配置文件

管理员:etc / hosts
127.0.0.1 localhost 127.0.1.1 joseph-Dell-System-XPS-L702X #The following lines are for hadoop master/slave setup 192.168.1.87 master 192.168.1.74 slave # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters

从站:etc / hosts
127.0.0.1 localhost 127.0.1.1 joseph-Home # this line was incorrect, it was set as 7.0.1.1 #the following lines are for hadoop mutli-node cluster setup 192.168.1.87 master 192.168.1.74 slave # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters

主站:core-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>hadoop.tmp.dir</name> <value>/home/hduser/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri’s scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a filesystem.</description> </property> </configuration>

从站:core-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>hadoop.tmp.dir</name> <value>/home/hduser/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri’s scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a filesystem.</description> </property> </configuration>

主站:hdfs-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>dfs.replication</name> <value>2</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> </configuration>

从站:hdfs-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>dfs.replication</name> <value>2</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> </configuration>

主站:mapred-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>mapred.job.tracker</name> <value>master:54311</value> <description>The host and port that the MapReduce job tracker runs at. If “local”, then jobs are run in-process as a single map and reduce task. </description> </property> </configuration>

从站:mapre-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>mapred.job.tracker</name> <value>master:54311</value> <description>The host and port that the MapReduce job tracker runs at. If “local”, then jobs are run in-process as a single map and reduce task. </description> </property> </configuration>

最佳答案

错误在etc / hosts中:

在错误运行期间，slave etc / hosts文件如下所示:
127.0.0.1 localhost 7.0.1.1 joseph-Home # THIS LINE IS INCORRECT, IT SHOULD BE 127.0.1.1 #the following lines are for hadoop mutli-node cluster setup 192.168.1.87 master 192.168.1.74 slave # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters

您可能已经发现，此计算机“joseph-Home”的IP地址配置错误。应将其设置为127.0.1.1时将其设置为7.0.1.1。因此，将第2行从属etc / hosts文件更改为127.0.1.1 joseph-Home可以解决此问题，并且我的日志通常显示在从属节点上。

新的etc / hosts文件:
127.0.0.1 localhost 127.0.1.1 joseph-Home # THIS LINE IS INCORRECT, IT SHOULD BE 127.0.1.1 #the following lines are for hadoop mutli-node cluster setup 192.168.1.87 master 192.168.1.74 slave # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters
关于java - Hadoop集群卡死在Reduce>复制>，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/18634825/