本文介绍了Pytesseract:打开数据文件 \\Program Files (x86)\\Tesseract-OCR\\en.traineddata 时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 Jupyter Notebook 上使用 pytesseract.

I am trying to use pytesseract on Jupyter Notebook.

  • Windows 10 x64
  • 使用管理权限运行 Jupyter Notebook(Anaconda3、Python 3.6.1)
  • 包含 TIFF 文件的工作目录位于不同的驱动器 (Z:)

当我运行以下代码时:

try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'

tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'

print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en', config = tessdata_dir_config))

我收到以下错误:

TesseractError                            Traceback (most recent call last)
<ipython-input-37-c1dcbc33cde4> in <module>()
     11 # tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
     12
---> 13 print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en'))
     14 # print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))

C:\Users\cpcho\AppData\Local\Continuum\Anaconda3\lib\site-packages\pytesseract\pytesseract.py in image_to_string(image, lang, boxes, config)
    123         if status:
    124             errors = get_errors(error_string)
--> 125             raise TesseractError(status, errors)
    126         f = open(output_file_name, 'rb')
    127         try:

TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\en.traineddata')

我发现这两个参考资料很有帮助,但我遗漏了一些东西:https://github.com/madmaze/pytesseract/issues/50https://github.com/madmaze/pytesseract/issues/64

I found these two references helpful but I am missing something:https://github.com/madmaze/pytesseract/issues/50https://github.com/madmaze/pytesseract/issues/64

感谢您抽出宝贵时间!

推荐答案

从您的帖子中,观察到两个可能的问题.

From your post, observed two possible issues.

  1. 所有经过训练的语言数据都应该保存在TESSDATA_PREFIX中,一个 Windows 环境变量,位于 C:\Program Files(x86)\Tesseract-OCR\tessdata 在你的情况下.

  1. All the trained language data should be saved in TESSDATA_PREFIX,a Windows environmental variable, which is at C:\Program Files(x86)\Tesseract-OCR\tessdata in your case.

tesseract 训练的英文数据被命名为 eng.traineddata(即 'eng'),除非你修改了它的名字.有关详细信息,请参阅此 Tesseract 数据文件.

The tesseract trained English data is named eng.traineddata (i.e. 'eng') unless you modified its name. Refer to this Tesseract Data Files for more information.

另外,为了pytesseract读取图像文件Image.open(),你可以包含完整的文件路径(例如'z:\\path\\to\\image') 如果图片文件无法定位.

In addition, for pytesseract to read the image file Image.open(), you may include the full file path (e.g. 'z:\\path\\to\\image') if the image file is unable to locate.

希望如此.

这篇关于Pytesseract:打开数据文件 \\Program Files (x86)\\Tesseract-OCR\\en.traineddata 时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-23 11:07
查看更多