问题描述
我写在C ++中,它可以扫描一个文件的算法滑动窗口,这意味着它会扫描字节0到n,做一些事情,然后扫描字节1到n + 1,做一些事情,等等,直到结束为止。
I'm writing an algorithm in C++ that scans a file with a "sliding window," meaning it will scan bytes 0 to n, do something, then scan bytes 1 to n+1, do something, and so forth, until the end is reached.
我的第一个算法是阅读前n个字节,有所为有所不为,自卸一个字节,读一个新的字节,并重复。这是非常缓慢的,因为以ReadFile的从HDD一次一个字节是低效的。 (100KB左右/秒)
My first algorithm was to read the first n bytes, do something, dump one byte, read a new byte, and repeat. This was very slow because to "ReadFile" from HDD one byte at a time was inefficient. (About 100kB/s)
我的第二个算法包括读取文件的块(也许是N * 1000字节,这意味着整个文件,如果它不是太大)到缓冲区和读取单个字节的缓冲区off。现在,我得到约10MB /秒(SSD体面+酷睿i5,1.6GHz的笔记本电脑)。
My second algorithm involves reading a chunk of the file (perhaps n*1000 bytes, meaning the whole file if it's not too large) into a buffer and reading individual bytes off the buffer. Now I get about 10MB/s (decent SSD + Core i5, 1.6GHz laptop).
我的问题:你有建议,为更快的模型
My question: Do you have suggestions for even faster models?
修改 我的大缓冲(相对于窗口大小)实现如下:结果
- 对于5kB的的滚动窗口,缓冲器被初始化为5MB结果
- 读取文件的第一5MB到缓冲搜索
- 窗口指针开始在缓冲器搜索的开始
- 移时,窗口指针递增结果
- 当窗口指针接近所述5MB缓冲区的末尾,(在4.99MB说),其余0.01MB复制到缓冲区的开始,重置窗口指针的开始,并读一个附加4.99MB到缓冲区。
- 重复
edit: My big buffer (relative to the window size) is implemented as follows:
- for a rolling window of 5kB, the buffer is initialized to 5MB
- read the first 5MB of the file into the buffer
- the window pointer starts at the beginning of the buffer
- upon shifting, the window pointer is incremented
- when the window pointer nears the end of the 5MB buffer, (say at 4.99MB), copy the remaining 0.01MB to the beginning of the buffer, reset the window pointer to the beginning, and read an additional 4.99MB into the buffer.- repeat
编辑2 - 实际执行(删除)
edit 2 - the actual implementation (removed)
谢谢全部为许多有见地的回应。这是很难选择一个最佳答案;他们都是优秀的,用我的编码帮助。
Thank you all for many insightful response. It was hard to select a "best answer"; they were all excellent and helped with my coding.
推荐答案
我用我的应用程序(实际上,滑动在彼此的顶部工作窗口几层,但是这一范围之外的一个滑动窗口讨论)。该窗口使用通过内存映射文件视图的CreateFileMapping()
和 MapViewOfFile()
,那么我有一个抽象层最重要的是。我问抽象层,我需要任何字节范围,并确保该文件映射和文件视图进行相应的调整等等,这些字节在内存中。每次请求字节的新系列的时候,如果需要的文件视图只调整。
I use a sliding window in one of my apps (actually, several layers of sliding windows working on top of each other, but that is outside the scope of this discussion). The window uses a memory-mapped file view via CreateFileMapping()
and MapViewOfFile()
, then I have an an abstraction layer on top of that. I ask the abstraction layer for any range of bytes I need, and it ensures that the file mapping and file view are adjusted accordingly so those bytes are in memory. Every time a new range of bytes is requested, the file view is adjusted only if needed.
文件视图位置和大小的页面边界由报道,是系统粒度的偶数倍的GetSystemInfo()
。仅仅因为一个扫描达到一个给定字节范围的端部并不一定意味着它已经到达一个页面边界的端部还没有,那么下一个扫描可能不需要在所有改变文件视图,下一个字节是已经在存储器中。如果一范围的第一请求的字节超过映射页面的右边界,该文件视图的左边缘被调节到所请求的页面的左侧边界和任何页面向左是未映射。如果在范围内的最后请求的字节超过最右侧的映射页面的右侧边界,一个新的页面被映射并添加到文件图。
The file view is positioned and sized on page boundaries that are even multiples of the system granularity as reported by GetSystemInfo()
. Just because a scan reaches the end of a given byte range does not necessarily mean it has reached the end of a page boundary yet, so the next scan may not need to alter the file view at all, the next bytes are already in memory. If the first requested byte of a range exceeds the right-hand boundary of a mapped page, the left edge of the file view is adjusted to the left-hand boundary of the requested page and any pages to the left are unmapped. If the last requested byte in the range exceeds the right-hand boundary of the right-most mapped page, a new page is mapped and added to the file view.
这听起来复杂得多,它确实是实现一旦你进入它的编码:
It sounds more complex than it really is to implement once you get into the coding of it:
这听起来像你是在固定大小的块扫描字节,因此这种做法是非常快,为非常有效的。基于这种技术,我可以依次扫描多 GIGBYTE 从开始文件结束很快,通常是我最慢的机器分钟或更少。如果你的文件较小那么系统粒度,甚至只是几百兆,你将很难发现在所有经过的任何时候(除非你扫描本身就是慢)。
It sounds like you are scanning bytes in fixed-sized blocks, so this approach is very fast and very efficient for that. Based on this technique, I can sequentially scan multi-GIGBYTE files from start to end fairly quickly, usually a minute or less on my slowest machine. If your files are smaller then the system granularity, or even just a few megabytes, you will hardly notice any time elapsed at all (unless your scans themselves are slow).
更新:这里是什么我用一个简单的变化:
Update: here is a simplified variation of what I use:
class FileView
{
private:
DWORD m_AllocGran;
DWORD m_PageSize;
HANDLE m_File;
unsigned __int64 m_FileSize;
HANDLE m_Map;
unsigned __int64 m_MapSize;
LPBYTE m_View;
unsigned __int64 m_ViewOffset;
DWORD m_ViewSize;
void CloseMap()
{
CloseView();
if (m_Map != NULL)
{
CloseHandle(m_Map);
m_Map = NULL;
}
m_MapSize = 0;
}
void CloseView()
{
if (m_View != NULL)
{
UnmapViewOfFile(m_View);
m_View = NULL;
}
m_ViewOffset = 0;
m_ViewSize = 0;
}
bool EnsureMap(unsigned __int64 Size)
{
// do not exceed EOF or else the file on disk will grow!
Size = min(Size, m_FileSize);
if ((m_Map == NULL) ||
(m_MapSize != Size))
{
// a new map is needed...
CloseMap();
ULARGE_INTEGER ul;
ul.QuadPart = Size;
m_Map = CreateFileMapping(m_File, NULL, PAGE_READONLY, ul.HighPart, ul.LowPart, NULL);
if (m_Map == NULL)
return false;
m_MapSize = Size;
}
return true;
}
bool EnsureView(unsigned __int64 Offset, DWORD Size)
{
if ((m_View == NULL) ||
(Offset < m_ViewOffset) ||
((Offset + Size) > (m_ViewOffset + m_ViewSize)))
{
// the requested range is not already in view...
// round down the offset to the nearest allocation boundary
unsigned __int64 ulNewOffset = ((Offset / m_AllocGran) * m_AllocGran);
// round up the size to the next page boundary
DWORD dwNewSize = ((((Offset - ulNewOffset) + Size) + (m_PageSize-1)) & ~(m_PageSize-1));
// if the new view will exceed EOF, truncate it
unsigned __int64 ulOffsetInFile = (ulNewOffset + dwNewSize);
if (ulOffsetInFile > m_FileSize)
dwNewViewSize -= (ulOffsetInFile - m_FileSize);
if ((m_View == NULL) ||
(m_ViewOffset != ulNewOffset) ||
(m_ViewSize != ulNewSize))
{
// a new view is needed...
CloseView();
// make sure the memory map is large enough to contain the entire view
if (!EnsureMap(ulNewOffset + dwNewSize))
return false;
ULARGE_INTEGER ul;
ul.QuadPart = ulNewOffset;
m_View = (LPBYTE) MapViewOfFile(m_Map, FILE_MAP_READ, ul.HighPart, ul.LowPart, dwNewSize);
if (m_View == NULL)
return false;
m_ViewOffset = ulNewOffset;
m_ViewSize = dwNewSize;
}
}
return true;
}
public:
FileView() :
m_AllocGran(0),
m_PageSize(0),
m_File(INVALID_HANDLE_VALUE),
m_FileSize(0),
m_Map(NULL),
m_MapSize(0),
m_View(NULL),
m_ViewOffset(0),
m_ViewSize(0)
{
// map views need to be positioned on even multiples
// of the system allocation granularity. let's size
// them on even multiples of the system page size...
SYSTEM_INFO si = {0};
if (GetSystemInfo(&si))
{
m_AllocGran = si.dwAllocationGranularity;
m_PageSize = si.dwPageSize;
}
}
~FileView()
{
CloseFile();
}
bool OpenFile(LPTSTR FileName)
{
CloseFile();
if ((m_AllocGran == 0) || (m_PageSize == 0))
return false;
HANDLE hFile = CreateFile(FileName, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, 0, NULL);
if (hFile == INVALID_HANDLE_VALUE)
return false;
ULARGE_INTEGER ul;
ul.LowPart = GetFileSize(hFile, &ul.HighPart);
if ((ul.LowPart == INVALID_FILE_SIZE) && (GetLastError() != 0))
{
CloseHandle(hFile);
return false;
}
m_File = hFile;
m_FileSize = ul.QuadPart;
return true;
}
void CloseFile()
{
CloseMap();
if (m_File != INVALID_HANDLE_VALUE)
{
CloseHandle(m_File);
m_File = INVALID_HANDLE_VALUE;
}
m_FileSize = 0;
}
bool AccessBytes(unsigned __int64 Offset, DWORD Size, LPBYTE *Bytes, DWORD *Available)
{
if (Bytes) *Bytes = NULL;
if (Available) *Available = 0;
if ((m_FileSize != 0) && (offset < m_FileSize))
{
// make sure the requested range is in view
if (!EnsureView(Offset, Size))
return false;
// near EOF, the available bytes may be less than requested
DWORD dwOffsetInView = (Offset - m_ViewOffset);
if (Bytes) *Bytes = &m_View[dwOffsetInView];
if (Available) *Available = min(m_ViewSize - dwOffsetInView, Size);
}
return true;
}
};
FileView fv;
if (fv.OpenFile(TEXT("C:\\path\\file.ext")))
{
LPBYTE data;
DWORD len;
unsigned __int64 offset = 0, filesize = fv.FileSize();
while (offset < filesize)
{
if (!fv.AccessBytes(offset, some size here, &data, &len))
break; // error
if (len == 0)
break; // unexpected EOF
// use data up to len bytes as needed...
offset += len;
}
fv.CloseFile();
}
这code被设计为允许随机跳跃在任何大小的数据文件中的任何地方。既然你是按顺序读取的字节,一些逻辑可以根据需要简化。
This code is designed to allow random jumping anywhere in the file at any data size. Since you are reading bytes sequentially, some of the logic can be simplified as needed.
这篇关于设计一个快速的&QUOT;滚动窗口&QUOT;文件阅读器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!