问题描述
由于没有 diff
,我如何比较两个 HDFS 文件?
How can I compare two HDFS files since there is no diff
?
我正在考虑使用 Hive 表并从 HDFS 加载数据,然后在 2 个表上使用连接语句.有没有更好的办法?
I was thinking of using Hive tables and loading data from HDFS and then using join statements on 2 tables. Is there any better approach?
推荐答案
hadoop 没有提供 diff
命令,但实际上你可以使用 diff
在 shell 中使用重定向代码>命令:
There is no diff
command provided with hadoop, but you can actually use redirections in your shell with the diff
command:
diff <(hadoop fs -cat /path/to/file) <(hadoop fs -cat /path/to/file2)
如果您只想知道两个文件是否相同而不关心差异,我建议另一种基于校验和的方法:您可以获得两个文件的校验和,然后比较它们.我认为 Hadoop 不需要生成校验和,因为它们已经存储,所以它应该很快,但我可能错了.我不认为有一个命令行选项,但您可以使用 Java API 轻松完成此操作并创建一个小应用程序:
If you just want to know if 2 files are identical or not without caring to know the differences, I would suggest another checksum-based approach: you could get the checksums for both files and then compare them. I think Hadoop doesn't need to generate checksums because they are already stored so it should be fast, but I may be wrong. I don't think there's a command line option for that but you could easily do this with the Java API and create a small app:
FileSystem fs = FileSystem.get(conf);
chksum1 = fs.getFileChecksum(new Path("/path/to/file"));
chksum2 = fs.getFileChecksum(new Path("/path/to/file2"));
return chksum1 == chksum2;
这篇关于HDFS 文件比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!