本文介绍了使用Python从pdf提取图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我们如何从PDF中提取图像(仅图像).
How can we extract images(only images) from PDF.
我使用了许多在线工具,它们都不是通用的.在大多数PDF中,它都使用整个图像而不是图像的屏幕快照.PDF链接-> sg.inflibnet.ac.in:8080/jspui/bitstream/10603/121661/9/09_chapter 4.pdf
I used many online tools, they all are not universal. In most of the PDF, it tools the screenshot of the whole image instead of the image.PDF link -> sg.inflibnet.ac.in:8080/jspui/bitstream/10603/121661/9/09_chapter 4.pdf
推荐答案
下面是一些使用pyPdf读取PDF文件,提取图像并将其生成为PIL.Image
的代码.您需要根据需要对其进行修改,它只是在这里演示如何遍历对象树.
Here is some code that reads a PDF-File using pyPdf, extracts images and yields them as a PIL.Image
. You need to modify it to your needs, it's just here to demonstrate how to walk the object tree.
import io
import pyPdf
import PIL.Image
infile_name = 'my.pdf'
with open(infile_name, 'rb') as in_f:
in_pdf = pyPdf.PdfFileReader(in_f)
for page_no in range(in_pdf.getNumPages()):
page = in_pdf.getPage(page_no)
# Images are part of a page's `/Resources/XObject`
r = page['/Resources']
if '/XObject' not in r:
continue
for k, v in r['/XObject'].items():
vobj = v.getObject()
# We are only interested in images...
if vobj['/Subtype'] != '/Image' or '/Filter' not in vobj:
continue
if vobj['/Filter'] == '/FlateDecode':
# A raw bitmap
buf = vobj.getData()
# Notice that we need metadata from the object
# so we can make sense of the image data
size = tuple(map(int, (vobj['/Width'], vobj['/Height'])))
img = PIL.Image.frombytes('RGB', size, buf,
decoder_name='raw')
# Obviously we can't really yield here, do something with `img`...
yield img
elif vobj['/Filter'] == '/DCTDecode':
# A compressed image
img = PIL.Image.open(io.BytesIO(vobj._data))
yield img
这篇关于使用Python从pdf提取图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!