HDFS 文件比较

本文介绍了HDFS 文件比较的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

由于没有 diff，我如何比较两个 HDFS 文件?

How can I compare two HDFS files since there is no diff?

我正在考虑使用 Hive 表并从 HDFS 加载数据，然后在 2 个表上使用连接语句.有没有更好的办法?

I was thinking of using Hive tables and loading data from HDFS and then using join statements on 2 tables. Is there any better approach?

推荐答案

hadoop 没有提供 diff 命令，但实际上你可以使用 diff 在 shell 中使用重定向代码>命令:

There is no diff command provided with hadoop, but you can actually use redirections in your shell with the diff command:

diff <(hadoop fs -cat /path/to/file) <(hadoop fs -cat /path/to/file2)

如果您只想知道两个文件是否相同而不关心差异，我建议另一种基于校验和的方法:您可以获得两个文件的校验和，然后比较它们.我认为 Hadoop 不需要生成校验和，因为它们已经存储，所以它应该很快，但我可能错了.我不认为有一个命令行选项，但您可以使用 Java API 轻松完成此操作并创建一个小应用程序:

If you just want to know if 2 files are identical or not without caring to know the differences, I would suggest another checksum-based approach: you could get the checksums for both files and then compare them. I think Hadoop doesn't need to generate checksums because they are already stored so it should be fast, but I may be wrong. I don't think there's a command line option for that but you could easily do this with the Java API and create a small app:

FileSystem fs = FileSystem.get(conf);
chksum1 = fs.getFileChecksum(new Path("/path/to/file"));
chksum2 = fs.getFileChecksum(new Path("/path/to/file2"));
return chksum1 == chksum2;

这篇关于HDFS 文件比较的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Command