本文介绍了如何减少魔杖内存使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 wand 和 pytesseract 来获取上传到 django 网站的 pdf 文本,如下所示:

I am using wand and pytesseract to get the text of pdfs uploaded to a django website like so:

image_pdf = Image(blob=read_pdf_file, resolution=300)
image_png = image_pdf.convert('png')

req_image = []
final_text = []

for img in image_png.sequence:
    img_page = Image(image=img)
    req_image.append(img_page.make_blob('png'))

for img in req_image:
    txt = pytesseract.image_to_string(PI.open(io.BytesIO(img)).convert('RGB'))
    final_text.append(txt)

return " ".join(final_text)

我让它在单独的 ec2 服务器中的 celery 中运行.然而,因为即使是 13.7 mb 的 pdf 文件,image_pdf 也会增长到大约 4gb,所以它被 oom 杀手阻止了.我不想为更高的内存付费,而是想尝试减少魔杖和 ImageMagick 使用的内存.由于它已经是异步的,我不介意增加计算时间.我略读了这个:http://www.imagemagick.org/Usage/files/#massive,但不确定是否可以用魔杖实现.另一种可能的解决方法是一次一页地打开 pdf,而不是一次将完整图像放入 RAM 中.或者,我如何使用 python 直接与 ImageMagick 交互,以便我可以使用这些内存限制技术?

I have it running in celery in a separate ec2 server. However, because the image_pdf grows to approximately 4gb for even a 13.7 mb pdf file, it is being stopped by the oom killer. Instead of paying for higher ram, I want to try to reduce the memory used by wand and ImageMagick. Since it is already async I don't mind increased computation times. I have skimmed this: http://www.imagemagick.org/Usage/files/#massive, but am not sure if it can be implemented with wand. Another possible fix is a way to open a pdf in wand one page at a time rather than putting the full image into RAM at once. Alternatively, how could I interface with ImageMagick directly using python so that I could use these memory limiting techniques?

推荐答案

记住 wand 库与 MagickWand API 集成,进而将 PDF 编码/解码工作委托给 ghostscript.MagickWand &ghostscript 分配了额外的内存资源,并尽量在每个任务结束时释放.但是,如果例程由 python 初始化,并由变量保存,则很有可能引入内存泄漏.

Remember that the wand library integrates with MagickWand API, and in turn, delegates PDF encoding/decoding work to ghostscript. Both MagickWand & ghostscript allocated additional memory resources, and do there best to deallocate at the end of each task. However, if routines are initialized by python, and held by a variable, it's more than possible to introduce memory-leaks.

这里有一些技巧可以确保正确管理内存.

Here's some tips to ensure memory is managed correctly.

  1. with 上下文管理用于所有 Wand 分配.这将确保所有资源都通过 __enter__ &__exit__ 管理处理程序.

  1. Use with context management for all Wand assignments. This will ensure all resources pass through __enter__ & __exit__ management handlers.

避免为传递数据而创建 blob.创建文件格式的 blob 时,MagickWand 将分配额外的内存来复制 &对图像进行编码,除了原始魔杖实例之外,python 还将保存结果数据.通常在开发环境中没问题,但在生产环境中可能会很快失控.

Avoid blob creation for passing data. When creating a file-format blob, MagickWand will allocated additional memory to copy & encode the image, and python will hold resulting data in addition to the originating wand instance. Usually fine on the dev environment, but can grow out of hand quickly in a production setting.

避免使用 Image.sequence.这是另一个需要大量复制的例程,导致 python 持有一堆内存资源.记住 ImageMagick 很好地管理图像堆栈,所以如果你不重新排序/操作单个帧,最好使用 MagickWand 方法和方法.不涉及python.

Avoid Image.sequence. This is another copy-heavy routine, and results in python holding a bunch of memory resources. Remember ImageMagick manages the image stacks very well, so if you're not reordering / manipulating individual frames, it's best to use MagickWand methods & not involve python.

每个任务都应该是一个独立的进程,并且可以在完成时干净地关闭.这对你作为队列工作者来说应该不是问题,但值得仔细检查线程/工作者配置 + 文档.

Each task should be an isolated process, and can cleanly shut-down on completion. This shouldn't be an issue for you w/ celery as a queue worker, but worth double checking the thread/worker configuration + docs.

注意分辨率.300 @ 16Q 的 pdf 分辨率会产生大量的光栅图像.使用许多 OCR(tesseract/opencv)技术,第一步是对入站数据进行预处理,以删除额外/不需要的颜色/通道/数据/&tc.

Watch out for resolution. A pdf resolution of 300 @ 16Q would result in a massive raster image. With many OCR (tesseract/opencv) techniques, the first step is to pre-process the inbound data to remove extra/unneeded colors / channels / data / &tc.

这是我将如何处理此问题的示例.请注意,我将利用 ctypes 来直接管理没有额外 python 资源的图像堆栈.

Here's an example of how I would approach this. Note, I'll leverage ctypes to directly manage the image stack w/o additional python resources.

import ctyles
from wand.image import Image
from wand.api import library

# Tell wand about C-API method
library.MagickNextImage.argtypes = [ctypes.c_void_p]
library.MagickNextImage.restype = ctypes.c_int

# ... Skip to calling method ...

final_text = []
with Image(blob=read_pdf_file, resolution=100) as context:
    context.depth = 8
    library.MagickResetIterator(context.wand)
    while(library.MagickNextImage(context.wand) != 0):
        data = context.make_blob("RGB")
        text = pytesseract.image_to_string(data)
        final_text.append(text)
return " ".join(final_text)

当然,您的里程可能会有所不同.如果您对 subprocess 感到满意,您可以执行 gs &直接tesseract,并消除所有python包装器.

Of course your milage may vary. If your comfortable with subprocess, you may be able to execute gs & tesseract directly, and eliminate all the python wrappers.

这篇关于如何减少魔杖内存使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-23 04:44
查看更多