问题描述
我正在使用 wand 和 pytesseract 来获取上传到 django 网站的 pdf 文本,如下所示:
I am using wand and pytesseract to get the text of pdfs uploaded to a django website like so:
image_pdf = Image(blob=read_pdf_file, resolution=300)
image_png = image_pdf.convert('png')
req_image = []
final_text = []
for img in image_png.sequence:
img_page = Image(image=img)
req_image.append(img_page.make_blob('png'))
for img in req_image:
txt = pytesseract.image_to_string(PI.open(io.BytesIO(img)).convert('RGB'))
final_text.append(txt)
return " ".join(final_text)
我让它在单独的 ec2 服务器中的 celery 中运行.然而,因为即使是 13.7 mb 的 pdf 文件,image_pdf 也会增长到大约 4gb,所以它被 oom 杀手阻止了.我不想为更高的内存付费,而是想尝试减少魔杖和 ImageMagick 使用的内存.由于它已经是异步的,我不介意增加计算时间.我略读了这个:http://www.imagemagick.org/Usage/files/#massive,但不确定是否可以用魔杖实现.另一种可能的解决方法是一次一页地打开 pdf,而不是一次将完整图像放入 RAM 中.或者,我如何使用 python 直接与 ImageMagick 交互,以便我可以使用这些内存限制技术?
I have it running in celery in a separate ec2 server. However, because the image_pdf grows to approximately 4gb for even a 13.7 mb pdf file, it is being stopped by the oom killer. Instead of paying for higher ram, I want to try to reduce the memory used by wand and ImageMagick. Since it is already async I don't mind increased computation times. I have skimmed this: http://www.imagemagick.org/Usage/files/#massive, but am not sure if it can be implemented with wand. Another possible fix is a way to open a pdf in wand one page at a time rather than putting the full image into RAM at once. Alternatively, how could I interface with ImageMagick directly using python so that I could use these memory limiting techniques?
推荐答案
记住 wand 库与 MagickWand
API 集成,进而将 PDF 编码/解码工作委托给 ghostscript
.MagickWand
&ghostscript
分配了额外的内存资源,并尽量在每个任务结束时释放.但是,如果例程由 python 初始化,并由变量保存,则很有可能引入内存泄漏.
Remember that the wand library integrates with MagickWand
API, and in turn, delegates PDF encoding/decoding work to ghostscript
. Both MagickWand
& ghostscript
allocated additional memory resources, and do there best to deallocate at the end of each task. However, if routines are initialized by python, and held by a variable, it's more than possible to introduce memory-leaks.
这里有一些技巧可以确保正确管理内存.
Here's some tips to ensure memory is managed correctly.
将
with
上下文管理用于所有 Wand 分配.这将确保所有资源都通过__enter__
&__exit__
管理处理程序.
Use
with
context management for all Wand assignments. This will ensure all resources pass through__enter__
&__exit__
management handlers.
避免为传递数据而创建 blob
.创建文件格式的 blob 时,MagickWand 将分配额外的内存来复制 &对图像进行编码,除了原始魔杖实例之外,python 还将保存结果数据.通常在开发环境中没问题,但在生产环境中可能会很快失控.
Avoid blob
creation for passing data. When creating a file-format blob, MagickWand will allocated additional memory to copy & encode the image, and python will hold resulting data in addition to the originating wand instance. Usually fine on the dev environment, but can grow out of hand quickly in a production setting.
避免使用 Image.sequence
.这是另一个需要大量复制的例程,导致 python 持有一堆内存资源.记住 ImageMagick 很好地管理图像堆栈,所以如果你不重新排序/操作单个帧,最好使用 MagickWand 方法和方法.不涉及python.
Avoid Image.sequence
. This is another copy-heavy routine, and results in python holding a bunch of memory resources. Remember ImageMagick manages the image stacks very well, so if you're not reordering / manipulating individual frames, it's best to use MagickWand methods & not involve python.
每个任务都应该是一个独立的进程,并且可以在完成时干净地关闭.这对你作为队列工作者来说应该不是问题,但值得仔细检查线程/工作者配置 + 文档.
Each task should be an isolated process, and can cleanly shut-down on completion. This shouldn't be an issue for you w/ celery
as a queue worker, but worth double checking the thread/worker configuration + docs.
注意分辨率.300 @ 16Q 的 pdf 分辨率会产生大量的光栅图像.使用许多 OCR(tesseract/opencv)技术,第一步是对入站数据进行预处理,以删除额外/不需要的颜色/通道/数据/&tc.
Watch out for resolution. A pdf resolution of 300 @ 16Q would result in a massive raster image. With many OCR (tesseract/opencv) techniques, the first step is to pre-process the inbound data to remove extra/unneeded colors / channels / data / &tc.
这是我将如何处理此问题的示例.请注意,我将利用 ctypes 来直接管理没有额外 python 资源的图像堆栈.
Here's an example of how I would approach this. Note, I'll leverage ctypes to directly manage the image stack w/o additional python resources.
import ctyles
from wand.image import Image
from wand.api import library
# Tell wand about C-API method
library.MagickNextImage.argtypes = [ctypes.c_void_p]
library.MagickNextImage.restype = ctypes.c_int
# ... Skip to calling method ...
final_text = []
with Image(blob=read_pdf_file, resolution=100) as context:
context.depth = 8
library.MagickResetIterator(context.wand)
while(library.MagickNextImage(context.wand) != 0):
data = context.make_blob("RGB")
text = pytesseract.image_to_string(data)
final_text.append(text)
return " ".join(final_text)
当然,您的里程可能会有所不同.如果您对 subprocess 感到满意,您可以执行 gs
&直接tesseract
,并消除所有python包装器.
Of course your milage may vary. If your comfortable with subprocess, you may be able to execute gs
& tesseract
directly, and eliminate all the python wrappers.
这篇关于如何减少魔杖内存使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!