问题描述
我有一个扫描的pdf文件,我尝试从中提取文本.我尝试使用pypdfocr在其上进行ocr,但出现错误:
I have a scanned pdf file and I try to extract text from it.I tried to use pypdfocr to make ocr on it but I have error:
搜索后,我找到了此解决方案在Windows平台中将Ghostscript链接到pypdfocr ,我尝试下载GhostScript并将其放入环境变量中,但仍然存在相同的错误.
After searching I found this solution Linking Ghostscript to pypdfocr in Windows Platform and I tried to download GhostScript and put it in environment variable but it still has the same error.
如何使用python搜索扫描的pdf文件中的文本?
How can I searh text in my scanned pdf file using python?
谢谢.
编辑:这是我的代码示例:
Edit: here is my code sample:
import os
import sys
import re
import json
import shutil
import glob
from pypdfocr import pypdfocr_gs
from pypdfocr import pypdfocr_tesseract
from PIL import Image
path = PATH_TO_MY_SCANNED_PDF
mainL = []
kk = {}
def new_init(self, kk):
self.lang = 'heb'
self.binary = "tesseract"
self.msgs = {
'TS_MISSING': """
Could not execute %s
Please make sure you have Tesseract installed correctly
""" % self.binary,
'TS_VERSION':'Tesseract version is too old',
'TS_img_MISSING':'Cannot find specified tiff file',
'TS_FAILED': 'Tesseract-OCR execution failed!',
}
pypdfocr_tesseract.PyTesseract.__init__ = new_init
wow = pypdfocr_gs.PyGs(kk)
tt = pypdfocr_tesseract.PyTesseract(kk)
def secFile(filename,oldfilename):
wow.make_img_from_pdf(filename)
files = glob.glob("X:/e206333106/ocr-114/balagan/" + '*.jpg')
for file in files:
im = Image.open(file)
im.save(file + ".tiff")
files = glob.glob("PATH" + '*.tiff')
for file in files:
tt.make_hocr_from_pnm(file)
pdftxt = ""
files = glob.glob("PATH" + '*.html')
for file in files:
with open(file) as myfile:
pdftxt = pdftxt + "#" + "".join(line.rstrip() for line in myfile)
findNum(pdftxt,oldfilename)
folder ="PATH"
for the_file in os.listdir(folder):
file_path = os.path.join(folder, the_file)
try:
if os.path.isfile(file_path):
os.unlink(file_path)
except Exception, e:
print e
def pdf2ocr(filename):
pdffile = filename
os.system('pypdfocr -l heb ' + pdffile)
def ocr2txt(filename):
pdffile = filename
output1 = pdffile.replace(".pdf","_ocr.txt")
output1 = "PATH" + os.path.basename(output1)
input1 = pdffile.replace(".pdf","_ocr.pdf")
os.system("pdf2txt" -o + output1 + " " + input1)
with open(output1) as myfile:
pdftxt="".join(line.rstrip() for line in myfile)
findNum(pdftxt,filename)
def findNum(pdftxt,pdffile):
l = re.findall(r'\b\d+\b', pdftxt)
output = open('PATH' + os.path.basename(pdffile) + '.txt', 'w')
for i in l:
output.write(",")
output.write(i)
output.close()
def is_ascii(s):
return all(ord(c) < 128 for c in s)
i = 0
files = glob.glob(path + '\\*.pdf')
print path
print files
for file in files:
if file.endswith(".pdf"):
if is_ascii(file):
print file
pdf2ocr(file)
ocr2txt(file)
else:
newname = "PATH" + str(i) + ".pdf"
shutil.copyfile(file, newname)
print newname
secFile(newname,file)
i = i + 1
files = glob.glob(path + '\\' + '*_ocr.pdf')
for file in files:
print file
shutil.copyfile(file, "PATH" + os.path.basename(file))
os.remove(file)
推荐答案
看看这个库: https://pypi.python.org/pypi/pypdfocr 但是PDF文件中也可以包含图像.您也许能够分析页面内容流.一些扫描仪将扫描的页面分解成图像,因此您不会获得带有ghostscript的文本.
Take a look at this library: https://pypi.python.org/pypi/pypdfocrbut a PDF file can have also images in it. You may be able to analyse the page content streams. Some scanners break up the single scanned page into images, so you won't get the text with ghostscript.
这篇关于将扫描的pdf转换为文本python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!