今天一台服务器 datanode服务自动停止了,查看datanode  log发现如下报错:

org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 1, volumes configured: 2, volumes failed: 1, volume failures tolerated: 0

意思是volumes出现故障,在hdfs-site.xml文件中有个配置:
<property>
<name>dfs.datanode.data.dir</name>
<value>/diskb/hadoop/hdfs/data,/diskc/hadoop/hdfs/data,/diskd/hadoop/hdfs/data</value>
</property>

<property>
        <name>dfs.datanode.failed.volumes.tolerated</name>
        <value>0</value>
</property>


dfs.datanode.failed.volumes.tolerated值为0,意思是当diska、diskb、diskc、diskd任何一块磁盘出现问题后,
datanode就会服务停止,如何设置为1,可以有一块故障。


#dmesg
出现大量I/O错误:
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(16): 88 00 00 00 00 01 51 00 01 2a 00 00 00 30 00 00
__ratelimit: 8 callbacks suppressed
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(16): 88 00 00 00 00 01 51 00 01 32 00 00 00 28 00 00
EXT4-fs error (device sdd1): __ext4_get_inode_loc: unable to read inode block - inode=2760773, block=706740260
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(16): 88 00 00 00 00 01 51 00 01 2a 00 00 00 30 00 00
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(16): 88 00 00 00 00 01 51 00 01 32 00 00 00 28 00 00
EXT4-fs error (device sdd1): __ext4_get_inode_loc: unable to read inode block - inode=2760802, block=706740262
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(16): 88 00 00 00 00 01 51 00 01 2a 00 00 00 30 00 00
EXT4-fs error (device sdd1): __ext4_get_inode_loc: unable to read inode block - inode=2760734, block=706740257
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(10): 28 00 11 80 01 2a 00 00 08 00
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(10): 28 00 11 80 01 3a 00 00 10 00
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(10): 28 00 11 80 01 52 00 00 08 00
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(10): 28 00 11 80 01 62 00 00 08 00
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(10): 28 00 11 80 01 42 00 00 08 00
end_request: I/O error, dev sdd, sector 293601570
EXT4-fs error (device sdd1): __ext4_get_inode_loc: unable to read inode block - inode=143361, block=36700192
end_request: I/O error, dev sdd, sector 5750391074
EXT4-fs error (device sdd1): __ext4_get_inode_loc: unable to read inode block - inode=2807809, block=718798880
end_request: I/O error, dev sdd, sector 5653922090

尝试新建文件报错如下:
#touch 111
touch: cannot touch `111': Read-only file system

硬盘的健康状况:
smartctl -H /dev/sdd

注意
result后边的结果:PASSED,这表示硬盘健康状态良好
如果这里显示Failure,那么最好立刻给服务器更换硬盘


可以肯定是这块sdd硬盘出现问题,可以将此节点服务器,从hadoop群集中排除,
umount这块硬盘,之后更换个新的,重新格式化mount,再将服务器重新加入到hadoop群集中即可。

网上有些朋友说进行linux修复模式,fsck下硬盘,但是为了避免再出现问题,还是直接换个新的。



12-18 07:09