问题描述
我知道shell中的getmerge命令可以完成这项工作。
但是如果我想在HDFS API作业之后合并这些输出,我该怎么办?对于java?
我真正想要的是HDFS上的一个合并文件。
我唯一能做的就是想到的是在此之后开始一项额外的工作。
谢谢!
猜测,因为我没有自己尝试过,但我认为你正在寻找的方法是,这是FsShell在运行时调用的方法 - getmerge
命令。 FileUtil.copyMerge
将两个FileSystem对象作为参数--FsShell使用FileSystem.getLocal来检索目标文件系统,但我没有看到任何导致您无法使用Path的原因。目标上的getFileSystem获得OutputStream
也就是说,我认为它不会赢得您的赞许 - 合并仍然在本地JVM中发生;所以你不是真的在 -getmerge
后加上 -put
。
I know that "getmerge" command in shell can do this work.
But what should I do if I want to merge these outputs after the job by HDFS API for java?
What i actually want is a single merged file on HDFS.
The only thing i can think of is to start an additional job after that.
thanks!
Guessing, because I haven't tried this myself, but I think the method you are looking for is FileUtil.copyMerge, which is the method that FsShell invokes when you run the -getmerge
command. FileUtil.copyMerge
takes two FileSystem objects as arguments - FsShell uses FileSystem.getLocal to retrieve the destination FileSystem, but I don't see any reason you couldn't instead use Path.getFileSystem on the destination to obtain an OutputStream
That said, I don't think it wins you very much -- the merge is still happening in the local JVM; so you aren't really saving very much over -getmerge
followed by -put
.
这篇关于Hadoop:我怎样才能减少输出到一个文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!