我正在尝试通过TableMapper
的子类在〜10TB HBase表上运行MapReduce作业。它基本上是重写整个表。输出配置如下:
FileOutputFormat.setOutputPath(job, tablePath);
TableMapReduceUtil.initTableMapperJob(
inputTableName,
tblScanner,
ResaltMapper.class,
ImmutableBytesWritable.class, //outputKeyClass,
KeyValue.class, // outputValueClass,
job);
HFileOutputFormat.configureIncrementalLoad(job, hTable);
我已经尝试过多次运行此作业,并且每次数小时后都会消失。我在应用程序日志中看到以下消息:
{"timeStamp":"18/02/17 14:48:26,375","level":"WARN","category":"output.FileOutputCommitter","message":"Could not delete hdfs://trinity/data/trinity/hfiles/TABLE/_temporary/1/_temporary/attempt_1518830631967_0004_m_000063_0 "}
{"timeStamp":"18/02/17 14:48:26,376","level":"WARN","category":"output.FileOutputCommitter","message":"Could not delete hdfs://trinity/data/trinity/hfiles/TABLE/_temporary/1/_temporary/attempt_1518830631967_0004_m_000101_0 "}
{"timeStamp":"18/02/17 14:48:26,377","level":"WARN","category":"output.FileOutputCommitter","message":"Could not delete hdfs://trinity/data/trinity/hfiles/TABLE/_temporary/1/_temporary/attempt_1518830631967_0004_m_000099_0 "}
{"timeStamp":"18/02/17 14:48:26,377","level":"WARN","category":"output.FileOutputCommitter","message":"Could not delete hdfs://trinity/data/trinity/hfiles/TABLE/_temporary/1/_temporary/attempt_1518830631967_0004_m_000112_0 "}
{"timeStamp":"18/02/17 14:48:26,381","level":"WARN","category":"hdfs.DFSClient","message":"Slow ReadProcessor read fields took 152920ms (threshold=30000ms); ack: seqno: 1 reply: 0 reply: 0 reply: 0 downstreamAckTimeNanos: 20402922, targets: [DatanodeInfoWithStorage[10.40.177.236:50010,DS-4d0bd79b-eaf3-4ec0-93f1-203b74bdf87b,DISK], DatanodeInfoWithStorage[10.40.176.118:50010,DS-8506c9ff-206d-48c5-b476-04b8dc396a1c,DISK], DatanodeInfoWithStorage[10.40.186.216:50010,DS-36dece52-50c7-47b0-a202-2ee595fabbcc,DISK]] "}
log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
我也从应用程序报告中看到此消息
NodeHealthReport 1/1 local-dirs are bad: /mnt/yarn/local; 1/1 log-dirs are bad: /mnt/yarn/logs
我不确定这些消息是否与故障有关。群集上有足够的可用空间,该群集具有4个d2.8xlarge实例(4台计算机上有96个2TB HDD)。但是,特定的硬盘驱动器已满。例如,在当前的工作中,一个只有大约9GB的可用空间,即使其他硬驱动器几乎有一半可用:
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 99G 5.0G 90G 6% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 121G 12K 121G 1% /dev
tmpfs 25G 672K 25G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 121G 32K 121G 1% /run/shm
none 100M 0 100M 0% /run/user
/dev/mapper/ephemeral_luks0 1.8T 1.7T 9.0G 100% /mnt
/dev/mapper/ephemeral_luks1 1.8T 974G 767G 56% /mnt1
/dev/mapper/ephemeral_luks2 1.8T 982G 760G 57% /mnt2
/dev/mapper/ephemeral_luks3 1.8T 997G 745G 58% /mnt3
/dev/mapper/ephemeral_luks4 1.8T 982G 760G 57% /mnt4
...snip...
有谁知道是什么原因造成的?我该如何解决这个问题?
最佳答案
我知道了,这是因为yarn.nodemanager.local-dirs
在集群中的每个节点上仅设置为单个HDD。为每个节点指定每个HDD可解决此问题。
关于hadoop - 大型MapReduce工作不断死去,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/48843620/