问题描述
我有一个将PDF文档转换为图像的组件,每页一个图像。由于组件使用转换器生成内存中的映像,因此它会严重影响JVM堆并花费一些时间来完成转换。
I have a component that converts PDF documents to images, one image per page. Since the component uses converters producing in-memory images, it hits the JVM heap heavily and takes some time to finish conversions.
我正在尝试提高整体性能转换过程,并找到一个带有JNI绑定的本机库,将PDF转换为TIFF。该库只能将PDF转换为单个TIFF文件(需要中间文件系统存储;甚至不使用转换流),因此结果TIFF文件已嵌入转换页面,而不是文件系统上的每页图像。拥有一个本地库可以极大地改善整体转换,并且性能变得非常快,但是存在一个真正的瓶颈:因为我必须将源页面转换为目标页面转换,现在我必须从结果文件中提取每个页面并写入其他所有人。一个简单而天真的方法 RenderedImage
s:
I'm trying to improve the overall performance of the conversion process, and found a native library with a JNI binding to convert PDFs to TIFFs. That library can convert PDFs to single TIFF files only (requires intermediate file system storage; does not even consume conversion streams), therefore result TIFF files have converted pages embedded, and not per-page images on the file system. Having a native library improves the overall conversion drastically and the performance gets really faster, but there is a real bottleneck: since I have to make a source-page to destination-page conversion, now I must extract every page from the result file and write all of them elsewhere. A simple and naive approach with RenderedImage
s:
final SeekableStream seekableStream = new FileSeekableStream(tempFile);
final ImageDecoder imageDecoder = createImageDecoder("tiff", seekableStream, null);
...
// V--- heap is wasted here
final RenderedImage renderedImage = imageDecoder.decodeAsRenderedImage(pageNumber);
// ... do the rest stuff ...
实际上,我会说非常想从TIFF容器文件( tempFile
)中提取具体的页面输入流,并将其重定向到其他地方,而不必将其存储为内存中的图像。我想象一种类似于容器处理的方法,我需要寻找一个特定的条目来从中提取数据(比如像ZIP文件处理等)。但是我在 ImageDecoder
中找不到类似的东西,或者我可能错了我的期望并错过了一些重要的东西......
Actually speaking, I would really like just to extract a concrete page input stream from the TIFF container file (tempFile
) and just redirect it to elsewhere without having it to be stored as an in-memory image. I would imagine an approach similar to containers processing where I need to seek for a specific entry to extract data from it (say, something like ZIP files processing, etc). But I couldn't find anything like that in ImageDecoder
, or I'm probably wrong with my expectations and just missing something important here...
是否可以使用JAI API或第三方备选方案提取TIFF容器页面输入流?在此先感谢。
Is it possible to extract TIFF container page input streams using JAI API or probably third-party alternatives? Thanks in advance.
推荐答案
我可能错了,但不要认为JAI支持拆分TIFF而不解码文件内存中的图像。并且,抱歉推销我自己的库,但我认为它完全符合您的需求(用于拆分TIFF的解决方案的主要部分由第三方提供)。
I could be wrong, but don't think JAI has support for splitting TIFFs without decoding the files to in-memory images. And, sorry for promoting my own library, but I think it does exactly what you need (the main part of the solution used to split TIFFs is contributed by a third party).
使用类,您应该能够拆分您的多页TIFF到多个单页TIFF,如下所示:
By using the TIFFUtilities
class from com.twelvemonkeys.contrib.tiff
, you should be able to split your multi-page TIFF to multiple single-page TIFFs like this:
TIFFUtilities.split(tempFile, new File("output"));
不完成图像解码,只将每个IFD拆分成单独的文件,然后写入流具有更正的偏移量和字节数。
No decoding of the images are done, only splitting each IFD into a separate file, and writing the streams with corrected offsets and byte counts.
文件将命名为 output / 0001.tif
, output / 0002.tif
等。如果您需要更多地控制输出名称或有其他要求,您可以轻松修改代码。该代码附带BSD风格的许可证。
Files will be named output/0001.tif
, output/0002.tif
etc. If you need more control over the output name or have other requirements, you can easily modify the code. The code comes with a BSD-style license.
这篇关于JAI:如何从多页TIFF图像容器中提取单页输入流?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!