问题描述
我正在编写一个控制台应用程序,该应用程序遍历二叉树并根据其md5校验和搜索新文件或更改过的文件.整个过程很快就可以接受(大约70.000个文件需要14秒),但是生成校验和大约需要5分钟,这太慢了...
I am writing a console application which iterates through a binary tree and searches for new or changed files based on their md5 checksums.The whole process is acceptable fast (14sec for ~70.000 files) but generating the checksums takes about 5min which is quite too slow...
对改进此过程有何建议?我的哈希函数如下:
Any suggestions for improving this process? My hash function is the following:
private string getMD5(string filename)
{
using (var md5 = new MD5CryptoServiceProvider())
{
if (File.Exists(@filename))
{
try
{
var buffer = md5.ComputeHash(File.ReadAllBytes(filename));
var sb = new StringBuilder();
for (var i = 0; i < buffer.Length; i++)
{
sb.Append(buffer[i].ToString("x2"));
}
return sb.ToString();
}
catch (Exception)
{
Program.logger.log("Error while creating checksum!", Program.logger.LOG_ERROR);
return "";
}
}
else
{
return "";
}
}
}
推荐答案
好的,可接受的答案是无效的,因为,当然,有一种方法可以提高代码性能.但是,对于其他一些想法也是有效的)
Well, accepted answer is not valid, because, of course, there is a ways to improve your code performance. It is valid for some other thoughts however)
除了磁盘I/O之外,这里的主要停止器是内存分配.这里有一些应该提高速度的想法:
Main stopper here, except disk I/O, is memory allocation. Here the some thoughts that should improve speed:
- 请勿读取内存中的整个文件进行计算,这很慢,并且会通过LOH对象产生很大的内存压力.而是将文件作为流打开,并按块计算哈希.
- 之所以使用
ComputeHash
流覆盖时会变慢,是因为它内部使用了非常小的缓冲区(4kb),因此请选择适当的缓冲区大小(256kb或更大,可以通过实验找到最佳值) - 使用 TransformBlock 和 TransformFinalBlock 函数可计算哈希值.您可以为
outputBuffer
参数传递null. - 将该缓冲区重新用于后续文件的哈希计算,因此不需要其他分配.
- 此外,您还可以重用
MD5CryptoServiceProvider
,但是好处令人怀疑. - 最后,您可以应用异步模式从流中读取块,因此,当您计算前一个块的部分哈希时,OS将同时从磁盘读取下一个块.当然,这样的代码更难编写,并且您至少需要两个缓冲区(也要重用它们),但这会对速度产生很大的影响.
- 作为一个较小的改进,请勿检查文件是否存在.我相信,您的函数是从某种枚举中调用的,几乎没有机会同时删除该文件.
- Do not read entire file in memory for calculation, it is slow, and it'll produce a lot of memory pressure via LOH objects. Instead open file as a stream, and calculate Hash by chunks.
- The reason, why you have slowdown when using
ComputeHash
stream override, because internally it use very small buffer (4kb), so choose appropriate buffer size (256kb or more, optimal value to be found by experimenting) - Use TransformBlock and TransformFinalBlock functions to calculate hash value. You can pass null for
outputBuffer
parameter. - Reuse that buffer for following files hash calculations, so there is no need for additional allocations.
- Additionally you can reuse
MD5CryptoServiceProvider
, but benefits are questionable. - And the last, you can apply async pattern for reading chunks from stream, so OS will read next chunk from disk on the same time, when you calculating partial hash for previous chunk. Of course such code is more difficult to write, and you'll need at least two buffers (reuse them as well), but it can provide great impact on speed.
- As a minor improvement, do not check for file existence. I believe, that your function called from some enumeration, and there is very little chance, that file is deleted meanwhile.
以上所有内容均适用于中型到大型文件.相反,如果您有很多非常小的文件,则可以通过并行处理文件来加快计算速度.实际上,并行化还可以帮助处理较大的文件,但这要进行衡量.
All above is valid for medium to large sized files. If you, instead, have a lot of very small files, you can speed calculation by processing files in parallel. Actually parallelization can also help with large files, but it is up to be measured.
最后,如果冲突不会给您带来太多麻烦,您可以选择价格较低的哈希算法,例如CRC.
And the last, if collisions doesn't bother you too much, you can chose less expensive hash algorithm, CRC, for example.
这篇关于创建文件校验和时的性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!