问题描述
我正在尝试在 Jupyter Notebook 上使用 pytesseract.
I am trying to use pytesseract on Jupyter Notebook.
- Windows 10 x64
- 使用管理权限运行 Jupyter Notebook(Anaconda3、Python 3.6.1)
- 包含 TIFF 文件的工作目录位于不同的驱动器 (Z:)
当我运行以下代码时:
try:
import Image
except ImportError:
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'
tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en', config = tessdata_dir_config))
我收到以下错误:
TesseractError Traceback (most recent call last)
<ipython-input-37-c1dcbc33cde4> in <module>()
11 # tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
12
---> 13 print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en'))
14 # print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))
C:\Users\cpcho\AppData\Local\Continuum\Anaconda3\lib\site-packages\pytesseract\pytesseract.py in image_to_string(image, lang, boxes, config)
123 if status:
124 errors = get_errors(error_string)
--> 125 raise TesseractError(status, errors)
126 f = open(output_file_name, 'rb')
127 try:
TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\en.traineddata')
我发现这两个参考资料很有帮助,但我遗漏了一些东西:https://github.com/madmaze/pytesseract/issues/50https://github.com/madmaze/pytesseract/issues/64
I found these two references helpful but I am missing something:https://github.com/madmaze/pytesseract/issues/50https://github.com/madmaze/pytesseract/issues/64
感谢您抽出宝贵时间!
推荐答案
从您的帖子中,观察到两个可能的问题.
From your post, observed two possible issues.
所有经过训练的语言数据都应该保存在
TESSDATA_PREFIX
中,一个 Windows 环境变量,位于C:\Program Files(x86)\Tesseract-OCR\tessdata
在你的情况下.
All the trained language data should be saved in
TESSDATA_PREFIX
,a Windows environmental variable, which is atC:\Program Files(x86)\Tesseract-OCR\tessdata
in your case.
tesseract
训练的英文数据被命名为 eng.traineddata
(即 'eng'
),除非你修改了它的名字.有关详细信息,请参阅此 Tesseract 数据文件.
The tesseract
trained English data is named eng.traineddata
(i.e. 'eng'
) unless you modified its name. Refer to this Tesseract Data Files for more information.
另外,为了pytesseract
读取图像文件Image.open()
,你可以包含完整的文件路径(例如'z:\\path\\to\\image'
) 如果图片文件无法定位.
In addition, for pytesseract
to read the image file Image.open()
, you may include the full file path (e.g. 'z:\\path\\to\\image'
) if the image file is unable to locate.
希望如此.
这篇关于Pytesseract:打开数据文件 \\Program Files (x86)\\Tesseract-OCR\\en.traineddata 时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!