本文介绍了hdfs数据节点与namenode断开连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 这个DataNode没有连接到一个或多个它的NameNode(s)。 

  Cloudera Manager代理得到了该角色的Web服务器的意外响应。 

(通常在一起,有时只有其中一个)

在大多数SO和Google的这些错误引用中,这个问题是一个配置问题(并且数据节点从未连接到名称节点)。

在我的如果数据节点通常在启动时连接,但在一段时间后会断开连接 - 所以它看起来不是一个错误的配置。




  • 任何其他选项?

  • 是否可以强制数据节点重新连接到名称节点?

  • 是否可以ping数据节点中的名称节点(模拟数据节点的连接尝试)
  • 它可能是某种资源问题(对于许多打开的文件\连接)?



样本日志(错误随时间变化)

  2014-02-25 06:39:49,179 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:异常:
java.net.SocketTimeoutException:480000 mi在等待通道准备写入时,llis超时。 ch:java.nio.channels.SocketChannel [connected local = /10.56.144.18:50010 remote = / 10.56.144.28:48089]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165 )
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
在org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
处使用org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504) )
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver .java:92)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver .run(DataXceiver.java:221)$ java.util.Thread中的
。运行(Thread.java:662)
2014-02-25 06:39:49,180 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:src:/10.56.144.18:50010,dest: /10.56.144.28:48089,字节:132096,op:HDFS_READ,cliID:DFSClient_NONMAPREDUCE_1315770947_27,偏移量:0,srvID:DS-990970275-10.56.144.18-50010-1384349167420,blockid:BP-1381780028-10.56.144.16-1384349161741:blk_ -8718668700255896235_5121440,持续时间:480291679056
2014-02-25 06:39:49,180 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:DatanodeRegistration(10.56.144.18,storageID = DS-990970275-10.56.144.18 -50010-1384349167420,infoPort = 50075,ipcPort = 50020,storageInfo = lv = -40; cid = cluster16; nsid = 7043943; c = 0):服务时出现异常BP-1381780028-10.56.144.16-1384349161741:blk_-8718668700255896235_5121440到/10.56.144.28:48089
java.net.SocketTimeoutException:480000毫秒超时,等待通道准备写入。 ch:java.nio.channels.SocketChannel [connected local = /10.56.144.18:50010 remote = / 10.56.144.28:48089]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165 )
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
在org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
处使用org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504) )
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver .java:92)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver .run(DataXceiver.java:221)$ java.util.Thread中的
。运行(Thread.java:662)
2014-02-25 06:39:49,181 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:host.com:50010:DataXceiver错误处理READ_BLOCK操作src: /10.56.144.28:48089 dest:/10.56.144.18:50010
java.net.SocketTimeoutException:480000 millis在等待通道准备写入时发生超时。 ch:java.nio.channels.SocketChannel [connected local = /10.56.144.18:50010 remote = / 10.56.144.28:48089]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165 )
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
在org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
处使用org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504) )
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver .java:92)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver .run(DataXceiver.java:221)$ java.util.Thread中的
。 run(Thread.java:662)


解决方案

请重新使用Linux,然后请确保您已正确配置这些属性:


  1. 禁用SELINUX

在CLI上键入命令getenforce,如果它显示强制执行,则表示它已启用。将其更改为/ etc / selinux / config文件。


  1. 禁用防火墙


  2. 确保您已安装NTP服务。

  3. 请确保所有节点都具有FQDN(完全限定的域名),并在/ etc / hosts中具有名称和IP的条目。
  4. >

如果这些设置正确无误,请附上任何已断开连接的datanode的日志。


I get from time to time the following errors in cloudera manager:

This DataNode is not connected to one or more of its NameNode(s).

and

The Cloudera Manager agent got an unexpected response from this role's web server.

(usually together, sometimes only one of them)

In most references to these errors in SO and Google, the issue is a configuration problem (and the data node never connects to the name node)

In my case the data nodes usually connect at start up, but loose the connection after some time - so it doesn't appear to be a bad configuration.

  • Any other options?
  • Is it possible to force the data node to reconnect to the name node?
  • Is it possible to "ping" the name node from the data node (simulate the connection attempt of the data node)
  • Could it be some kind of resource problem (to many open files \ connections)?

sample logs (the errors vary from time to time)

2014-02-25 06:39:49,179 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: exception:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.56.144.18:50010 remote=/10.56.144.28:48089]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
        at java.lang.Thread.run(Thread.java:662)
2014-02-25 06:39:49,180 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.56.144.18:50010, dest: /10.56.144.28:48089, bytes: 132096, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1315770947_27, offset: 0, srvID: DS-990970275-10.56.144.18-50010-1384349167420, blockid: BP-1381780028-10.56.144.16-1384349161741:blk_-8718668700255896235_5121440, duration: 480291679056
2014-02-25 06:39:49,180 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.56.144.18, storageID=DS-990970275-10.56.144.18-50010-1384349167420, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=cluster16;nsid=7043943;c=0):Got exception while serving BP-1381780028-10.56.144.16-1384349161741:blk_-8718668700255896235_5121440 to /10.56.144.28:48089
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.56.144.18:50010 remote=/10.56.144.28:48089]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
        at java.lang.Thread.run(Thread.java:662)
2014-02-25 06:39:49,181 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: host.com:50010:DataXceiver error processing READ_BLOCK operation  src: /10.56.144.28:48089 dest: /10.56.144.18:50010
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.56.144.18:50010 remote=/10.56.144.28:48089]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
        at java.lang.Thread.run(Thread.java:662)
解决方案

If you're using Linux then please make sure that you have configured these properties correctly:

  1. Disable SELINUX

type the command getenforce on CLI and if it shows enforcing, means it is enabled. Change it fro /etc/selinux/config file.

  1. Disable Firewall

  2. Make sure you have NTP service installed.

  3. Make sure your server can SSH to all client nodes.

  4. Make sure all the nodes have FQDN(Fully Qualified Domain Name) and have an entry in /etc/hosts with name and IP.

If these settings are right in the place then please attach the log of any of your datanode which got disconnected.

这篇关于hdfs数据节点与namenode断开连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 00:41
查看更多