本文介绍了懒惰地读取D中的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用D编写目录树扫描功能,该功能试图结合grep和file之类的工具,并且有条件地grep组合文件中的内容,前提是 not 不匹配一组指示字节的魔术字节文件类型,例如ELF,图像等.

I'm writing a directory tree scanning function in D that tries to combine tools such as grep and file and conditionally grep for things in a file only if it's not matching a set of magic bytes indicating filetypes such as ELF, images, etc.

从最小化文件io的角度来看,使这种排除逻辑尽可能快地运行的最佳方法是什么?如果只需要在开始时读取一些魔术字节,通常我就不想读取整个文件.但是,为了使代码更通用(某些魔术可能位于开头或结尾处的其他地方),如果我可以使用类似mmap的接口仅在我愿意的时候从磁盘懒惰地获取数据,那将是很好的读.数组接口还简化了我的算法.

What is the best approach to making such an exclusion logic run as fast as possible with regards to minimizing file io? I typically don't want to read in the whole file if I only need to read some magic bytes in the beginning. However to make the code more future-general (some magics may lie at the end or somewhere else than at the beginning) it would be nice if I could use a mmap-like interface to lazily fetch data from the disk only when I it is read. The array interface also simplifies my algorithms.

在这种情况下,D的std.mmfile是最佳选择吗?

Is D's std.mmfile the best option in this case?

更新:根据这篇文章,我认为建议使用mmap: http://forum.dlang.org/thread/[email protected]

Update: According to this post I guess mmap is adviced: http://forum.dlang.org/thread/[email protected]

如果我只需要作为数组(opIndex)进行读取访问,在std.stdio.Filestd.file上使用std.mmfile有什么缺点吗?

If I only need read-access as an array (opIndex) are there any cons to using std.mmfile over std.stdio.File or std.file?

推荐答案

如果您想用Phobos懒惰地读取文件,则几乎有三个选择

If you want to lazily read a file with Phobos, you pretty much have three options

  1. 使用std.stdio.FilebyLine一次读取一行.

使用std.stdio.FilebyChunk并一次读取特定数量的字节.

Use std.stdio.File's byChunk and read a particular number of bytes at a time.

使用std.mmfile.MmFile并将文件作为数组进行操作,利用引擎盖下的mmap来避免读取整个文件.

Use std.mmfile.MmFile and operate on the file as an array, taking advantage of mmap underneath the hood to avoid reading in the whole file.

我完全希望#3最快(剖析可能会有所不同,但是考虑到mmap的出色表现,我会感到非常惊讶).它也可能是最容易使用的,因为您可以操作一个数组.我知道的MmFile唯一的问题是,这是一个类,当它可以被认为是引用计数的结构时,它可以在完成后自动清理.现在,如果您不想等待GC清理它,则必须手动调用unmap或使用destroy销毁它而不释放其内存(尽管destroy应该是谨慎使用).使用mmap可能会有一些缺点(这自然意味着使用MmFile会有缺点),但我不知道有什么缺点.

I fully expect that #3 is going to be the fastest (profiling could prove differently, but I'd be very surprised given how fantastic mmap is). It's also probably the easiest to use, because you get an array to operate on. The only problem with MmFile that I'm aware of is that it's a class when it should arguably be a ref-counted struct so that it would clean itself up when you were done. Right now, if you don't want to wait for the GC to clean it up, you'd have to manually call unmap on it or use destroy to destroy it without freeing its memory (though destroy should be used with caution). There may be some sort of downside to using mmap (which would then naturally mean that there was a downside to using MmFile), but I'm not aware of any.

将来,我们将最终获得一些基于范围的流I/O内容,这些内容可能会更接近您的需要,而无需实际使用mmap,但这尚未完成,并且mmap太酷了,很有可能使用MmFile还是更好.

In the future, we're going to end up with some range-based streaming I/O stuff, which might be closer to what you need without actually using mmap, but that hasn't been completed yet, and mmap is so incredibly cool that there's a good chance that it would still be better to use MmFile.

这篇关于懒惰地读取D中的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 19:16