本文介绍了在Python中从PDF提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个包含引号的PDF:
I have a PDF full of quotes:
https://www.pdf-archive.com/2017/03/22/test/
我可以使用以下代码在python中提取文本:
I can extract the text in python using the following code:
import PyPDF2
pdfFileObj = open('example.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print (pageObj.extractText())
这会将所有引号作为一个段落返回.是否可以通过水平分隔符将pdf文件拆分"并以这种方式将其拆分为引号?
This returns all the quotes as one paragraph. Is it possible to 'split' the pdf by the horizontal separator and split it into quotes that way?
推荐答案
如果只想从pdf文本中提取引号,则可以使用regex
查找所有引号.
If you want to just extract the quotes from the pdf text you can use regex
to find all the quotes.
import PyPDF2
import re
pdfFileObj = open('test.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
text = str(pageObj.extractText())
quotes = re.findall(r'"[^"]*"',text)
for quote in quotes:
print quote
print
或者只是
quotes = re.findall(r'"[^"]*"',text)
print quotes
这篇关于在Python中从PDF提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!