问题描述
我正在处理巨大的TIFF图像(灰度,8或16位,最大4 GB),用作机器的高分辨率输入数据.每个图像需要旋转90度(顺时针).输入TIFF可以是LZW或未压缩的,输出可以是未压缩的.
到目前为止,我已经在Objective C(包括LZW解压缩)中实现了自己的TIFF阅读器类,该类能够处理大文件并在内存中进行一些缓存.目前,TIFF阅读器类用于图像内部的可视化和测量,并且效果很好.
对于我的最新挑战,轮换TIFF,我需要一种新方法,因为当前的实现速度非常慢.即使对于中型"尺寸的TIFF(30.000 x 4.000),也需要大约30%的时间. 30分钟以旋转图像.此刻,我遍历所有像素,并选择具有相反x和y坐标的像素,将它们全部放入缓冲区,并在一行完成后立即将缓冲区写入磁盘.主要问题是从TIFF读取数据,因为数据是按条带组织的,并且不能保证在文件内线性分布(对于LZW压缩条带,也没有线性关系.)
我分析了我的软件,发现大部分时间都花在复制内存块(内存)上,并决定绕过我的阅读器类中的缓存以进行轮换.现在,整个过程的速度提高了大约5%,这并不过分,而且现在所有的时间都花在了fread()中.我假设至少我的缓存的性能几乎与系统的fread()缓存一样好.
另一个使用Image Magick和相同的30.000 x 4.000文件进行的测试仅花费了大约10秒钟即可完成. AFAIK Image Magick将整个文件读取到内存中,在内存中进行处理,然后再写回磁盘.最多可以处理数百兆的图像数据.
我正在寻找某种元优化",例如另一种处理像素数据的方法.除了逐个交换像素(还需要从彼此相距较远的文件位置进行读取)之外,还有其他策略吗?我应该创建一些中间文件来加快该过程吗?欢迎任何建议.
确定,因为您必须进行像素调整,让我们看一下您的整体问题.30000x4000像素的中等图像是8位灰度的120M图像数据和16位的240M图像数据.因此,如果您以这种方式查看数据,则需要询问"30分钟合理吗?"为了进行90度旋转,您会在内存方面引发最坏的情况.您要触摸单列中的每个像素以填充一行.如果按行工作,至少不会使内存占用量翻倍.
因此-120M像素表示您正在执行120M读取和120M写操作,或240M数据访问.这意味着您每秒处理大约66,667像素,我认为这太慢了.我认为您应该每秒处理至少百万像素,甚至可能更多.
如果这是我,我将运行配置文件工具,查看瓶颈在哪里并消除它们.
在不知道您的确切结构且不必猜测的情况下,我将执行以下操作:
尝试为源图像使用一个连续的内存块
我希望看到这样的旋转功能:
void RotateColumn(int column, char *sourceImage, int bytesPerRow, int bytesPerPixel, int height, char *destRow)
{
char *src = sourceImage + (bytesPerPixel * column);
if (bytesPerPixel == 1) {
for (int y=0; y < height; y++) {
*destRow++ = *src;
src += bytesPerRow;
}
}
else if (bytesPerPixel == 2) {
for (int y=0; y < height; y++) {
*destRow++ = *src;
*destRow++ = *(src + 1);
src += bytesPerRow;
// although I doubt it would be faster, you could try this:
// *destRow++ = *src++;
// *destRow++ = *src;
// src += bytesPerRow - 1;
}
}
else { /* error out */ }
}
我猜循环的内部可能会变成8条指令.在2GHz处理器上(假设每条指令名义上有4个周期,这只是一个猜测),您应该能够在一秒钟内旋转6.25亿个像素.大概.
如果您不能连续进行,请一次处理多条目的地扫描线.
如果源图像被分成块,或者您有一条扫描线抽象的内存,您要做的就是从源图像中获取一条扫描线,然后一次旋转几十列,直到目的地扫描线的缓冲区中. /p>
让我们假设您具有一种抽象访问扫描线的机制,您可以在其中获取,释放和写入扫描线.
然后,您要做的就是找出您愿意一次处理多少个源列,因为您的代码看起来像这样:
void RotateNColumns(Pixels &source, Pixels &dest, int startColumn, int nCols)
{
PixelRow &rows[nRows];
for (int i=0; i < nCols; i++)
rows[i] = dest.AcquireRow(i + startColumn);
for (int y=0; y < source.Height(); y++) {
PixelRow &srcRow = source.AcquireRow();
for (int i=0; i < nCols; i++) {
// CopyPixel(int srcX, PixelRow &destRow, int dstX, int nPixels);
sourceRow.CopyPixel(startColumn + i, rows[i], y, 1);
}
source.ReleaseRow(srcRow);
}
for (int i=0; i < nCols; i++)
dest.ReleaseAndWrite(rows[i]);
}
在这种情况下,如果您在较大的扫描线块中缓冲源像素,则不一定会碎片化堆,并且可以选择将解码后的行刷新到磁盘中.您一次处理n列,并且内存局部性应提高n倍.然后就变成了缓存的昂贵程度的问题.
可以通过并行处理解决问题吗?
老实说,我认为您的问题应该是IO约束,而不是CPU约束.我认为您的解码时间将占主导地位,但让我们假装并非如此.
以这种方式进行思考-如果您一次读取源图像的整行,则可以将解码后的行扔到一个线程中,该线程会将其写入目标图像的适当列中.因此,编写您的解码器,使其具有类似于OnRowDecoded(byte * row,int y,int width,int bytesPerPixel)的方法;然后在解码时旋转. OnRowDecoded()打包信息并将其交给拥有dest图像的线程,并将整个解码行写入正确的dest列.当主线程忙于解码下一行时,该线程将所有数据写入目的地.可能辅助线程会先完成,但可能不会.
您将需要使SetPixel()达到线程安全的目的,但是除此之外,没有理由这应该是串行任务.实际上,如果您的源图像使用TIFF功能,即被分为带或图块,则可以并且应该并行解码它们.
I'm processing huge TIFF images (grayscale, 8 or 16 bit, up to 4 GB) to be used as high resolution input data for a machine. Each image needs to be rotated by 90 degrees (clockwise). The input TIFF can be LZW or uncompressed, the output may be uncompressed.
So far I implemented my own TIFF reader class in Objective C (including LZW decompression) which is able to handle huge files and does some caching in memory as well. At the moment the TIFF reader class is used for visualization and measurement inside the image and it performs quite good.
For my latest challenge, rotating a TIFF, I need a new approach, because the current implementation is VERY slow. Even for a "medium" sized TIFF (30.000 x 4.000) it takes approx. 30 minutes to rotate the image. At the moment I loop through all pixels and pick the one with reversed x and y coordinates, put all of them into a buffer and write the buffer to disk as soon as one line is complete. The main problem is the reading from the TIFF, since data is organized in strips and not guaranteed to be linearly distributed inside the file (and in case of LZW compressed strips, nothing is linear as well).
I profiled my software and found out that most of the time is spent in copying memory blocks (memmove) and decided to bypass the caching inside my reader class for the rotation. Now the whole process is about 5% faster, which isn't too much, and all of the time is now spent inside fread(). I assume that at least my cache performs almost as well as the system's fread() cache.
Another test using Image Magick with the same 30.000 x 4.000 file took only around 10 seconds to complete. AFAIK Image Magick reads the whole file into memory, processes it in memory and then writes back to disk. This works well up to a few hundred megabytes of image data.
What I'm looking for is some kind of "meta optimization", like another approach for handling the pixel data. Is there another strategy than swapping pixels one by one (and needing to read from file locations far away from each other)? Should I create some intermediate file to speed up the process? Any suggestions welcome.
OK given that you have to do pixel munging, let's look at your overall problem.A medium image that is 30000x4000 pixels is 120M of image data for 8 bit gray and 240M of image data for 16 bit. So if you're looking at the data this way, you need to ask "is 30 minutes reasonable?" In order to do a 90 degree rotate, you are inducing a worst-case problem, memory-wise. You are touching every pixel in a single column in order to fill one row. If you work row-wise, at least you're not going to double the memory foot-print.
So - 120M of pixels means that you're doing 120M reads and 120M writes, or 240M data accesses. This means that you are processing roughly 66,667 pixels per second, which I think is too slow. I think you should be processing at least half a million pixels per second, probably way more.
If this were me, I'd run my profiling tools and see where the bottlenecks are and cut them out.
Without knowing your exact structure and having to guess, I would do the following:
Attempt to use one contiguous block of memory for the source image
I'd prefer to see a rotate function like this:
void RotateColumn(int column, char *sourceImage, int bytesPerRow, int bytesPerPixel, int height, char *destRow)
{
char *src = sourceImage + (bytesPerPixel * column);
if (bytesPerPixel == 1) {
for (int y=0; y < height; y++) {
*destRow++ = *src;
src += bytesPerRow;
}
}
else if (bytesPerPixel == 2) {
for (int y=0; y < height; y++) {
*destRow++ = *src;
*destRow++ = *(src + 1);
src += bytesPerRow;
// although I doubt it would be faster, you could try this:
// *destRow++ = *src++;
// *destRow++ = *src;
// src += bytesPerRow - 1;
}
}
else { /* error out */ }
}
I'm guessing that the inside of the loop will turn into maybe 8 instructions. On a 2GHz processor (let's say nominally 4 cycles per instruction, which is just a guess), you should be able to rotate 625 million pixels in a second. Roughly.
If you can't do contiguous, work on multiple dest scanlines at once.
If the source image is broken into blocks or you have a scanline abstraction of memory, what you do is get a scanline from the source image and rotate, say, a few dozen columns at once into a buffer of dest scanlines.
Let's assume that you have a mechanism for accessing scanlines abstractly, wherein you can acquire and release and write to scanlines.
Then what you're going to do is figure out how many source columns you're willing to process at once, because you're code will look something like this:
void RotateNColumns(Pixels &source, Pixels &dest, int startColumn, int nCols)
{
PixelRow &rows[nRows];
for (int i=0; i < nCols; i++)
rows[i] = dest.AcquireRow(i + startColumn);
for (int y=0; y < source.Height(); y++) {
PixelRow &srcRow = source.AcquireRow();
for (int i=0; i < nCols; i++) {
// CopyPixel(int srcX, PixelRow &destRow, int dstX, int nPixels);
sourceRow.CopyPixel(startColumn + i, rows[i], y, 1);
}
source.ReleaseRow(srcRow);
}
for (int i=0; i < nCols; i++)
dest.ReleaseAndWrite(rows[i]);
}
In this case, if you buffer up your source pixels in large-ish blocks of scanlines, you're not necessarily fragmenting your heap and you have the choice of possibly flushing decoded rows out to disk. You process n columns at a time and your memory locality should improve by a factor of n. Then it becomes a question of how expensive your caching is.
Can the problem be solved with parallel processing?
Honestly, I think your problem should be IO bound, not CPU bound. I'd think that your decoding time will dominate, but let's pretend it isn't, for grins.
Think about it this way - if you read the source image a whole row at a time, you could toss that decoded row to a thread that will write it into the appropriate column of the destination image. So write your decoder so that it has a method like OnRowDecoded(byte *row, int y, int width, int bytesPerPixel); And then you're rotating while you're decoding. OnRowDecoded() packs up the information and hands it to a thread that owns the dest image and writes the entire decoded row into the correct dest column. That thread does all the writing to the dest while the main thread is busy decoding the next row. Likely the worker thread will finish first, but maybe not.
You will need to make your SetPixel() to the dest be thread safe, but other than that, there's no reason this should be a serial task. In fact, if your source images use the TIFF feature of being divided up into bands or tiles, you can and should be decoding them in parallel.
这篇关于如何加快将大型TIFF旋转90度的速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!