python - 无法从PyPDF2上的regex接收正确格式的PDF

我想从PDF中提取特定单词的所有实例，例如“数学”。
到目前为止，我正在使用PyPDF2将PDF转换为文本，然后对其进行正则表达式以查找所需内容。这是example PFD

当我运行我的代码而不是返回我的'math'正则表达式模式时，它将返回整个页面的字符串。请帮忙谢谢

#First Change Current Working Directory to desktop

import os
os.chdir('/Users/Hussein/Desktop')         #File is located on Desktop


#Second is the PyPDF2

pdfFileObj=open('TEST1.pdf','rb')          #Opening the File
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
pageObj=pdfReader.getPage(3)               #For the test I only need page 3
TextVersion=pageObj.extractText()
print(TextVersion)



#Third is the Regular Expression

import re
match=re.findall(r'math',TextVersion)
for match in TextVersion:
      print(match)

我得到的不仅仅是得到所有“数学”实例的信息：

I
n
t
r
o
d
u
c
t
i
o
n

等

最佳答案

TextVersion变量保存文本。当您将其用于for循环时，如您所见，它将一次为文本提供一个字符。 findall函数将返回一个匹配项列表，因此，如果在for循环中使用此匹配项，则将得到每个单词（测试中的单词将完全相同）。

import re

for match in re.findall(r'math',TextVersion):
      print(match)

findall返回的结果将类似于：

["math", "math", "math"]

因此，您的输出将是：

math
math
math

关于python - 无法从PyPDF2上的regex接收正确格式的PDF，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/32095476/

pyPDF2

python - 无法从PyPDF2上的regex接收正确格式的PDF