算法来找到重复

本文介绍了算法来找到重复的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有什么著名的算法，有效地找到重复？

Are there any famous algorithms to efficiently find duplicates?

有关如假设如果我有数千张照片，这些照片被命名为唯一的名称。有可能是重复的可能存在于不同的子文件夹的机会。是使用std ::地图或其他任何散列地图是一个好主意？

For e.g. Suppose if I have thousands of photos and the photos are named with unique names. There could be chances that duplicate could exist in different sub-folders. Is using std::map or any other hash-maps is a good idea?

推荐答案

如果您处理的文件，一个想法是先验证文件的lenght，然后生成一个散列只为具有相同大小的文件。

If your dealing with files, one idea is to first verify the file's lenght, and then generate a hash just for the files that have the same size.

然后，只需比较文件的哈希值。如果它们是相同的，你有一个重复的文件。

Then just compare the file's hashes. If they're the same, you've got a duplicate file.

还有安全性和准确性之间的权衡：有可能发生的事情，谁知道，到具有相同的散列不同的文件。所以，你可以提高你的解决方案：生成一个简单，快捷的哈希查找复本。当他们是不同的，你有不同的文件。当它们相等，生成第二哈希值。如果第二哈希是不同的，你只要有一个假阳性。如果他们再次相等，可能是你有一个真正的副本。

There's a tradeoff between safety and accuracy: there might happen, who knows, to have different files with the same hash. So you can improve your solution: generate a simple, fast hash to find the dups. When they're different, you have different files. When they're equal, generate a second hash. If the second hash is different, you just had a false positive. If they're equal again, probably you have a real duplicate.

在换句话说：

generate file sizes
for each file, verify if there's some with the same size.
if you have any, then generate a fast hash for them.
compare the hashes.
If different, ignore.
If equal: generate a second hash.
Compare.
If different, ignore.
If equal, you have two identical files.

做一个哈希每个文件需要太多的时间，将是无用的，如果大部分的文件是不同的。

Doing a hash for every file will take too much time and will be useless if most of your files are different.

这篇关于算法来找到重复的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！