问题描述
有什么著名的算法,有效地找到重复?
Are there any famous algorithms to efficiently find duplicates?
有关如假设如果我有数千张照片,这些照片被命名为唯一的名称。有可能是重复的可能存在于不同的子文件夹的机会。是使用std ::地图或其他任何散列地图是一个好主意?
For e.g. Suppose if I have thousands of photos and the photos are named with unique names. There could be chances that duplicate could exist in different sub-folders. Is using std::map or any other hash-maps is a good idea?
推荐答案
如果您处理的文件,一个想法是先验证文件的lenght,然后生成一个散列只为具有相同大小的文件。
If your dealing with files, one idea is to first verify the file's lenght, and then generate a hash just for the files that have the same size.
然后,只需比较文件的哈希值。如果它们是相同的,你有一个重复的文件。
Then just compare the file's hashes. If they're the same, you've got a duplicate file.
还有安全性和准确性之间的权衡:有可能发生的事情,谁知道,到具有相同的散列不同的文件。所以,你可以提高你的解决方案:生成一个简单,快捷的哈希查找复本。当他们是不同的,你有不同的文件。当它们相等,生成第二哈希值。如果第二哈希是不同的,你只要有一个假阳性。如果他们再次相等,可能是你有一个真正的副本。
There's a tradeoff between safety and accuracy: there might happen, who knows, to have different files with the same hash. So you can improve your solution: generate a simple, fast hash to find the dups. When they're different, you have different files. When they're equal, generate a second hash. If the second hash is different, you just had a false positive. If they're equal again, probably you have a real duplicate.
在换句话说:
generate file sizes
for each file, verify if there's some with the same size.
if you have any, then generate a fast hash for them.
compare the hashes.
If different, ignore.
If equal: generate a second hash.
Compare.
If different, ignore.
If equal, you have two identical files.
做一个哈希每个文件需要太多的时间,将是无用的,如果大部分的文件是不同的。
Doing a hash for every file will take too much time and will be useless if most of your files are different.
这篇关于算法来找到重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!