本文介绍了为什么pytesseract引发阿拉伯语错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用pytesseract阿拉伯语,并且我的系统/usr/share/tesseract/tessdata/路径中有ara.traineddata,并且我已经安装了tesseract软件包

这是我的代码:

 导入pytesseract从PIL导入图像pytesseract.image_to_string(Image.open('test_arabic.png'),config ='',lang ="ara") 

我得到这个错误:

  TesseractError追溯(最近一次通话) 
中的
  ---->1 pytesseract.image_to_string(Image.open('test_persian.png'),config =,lang =" ara)〜/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py in image_to_string(image,lang,config,nice,output_type,timeout)368个参数= [image,'txt',lang,config,nice,timeout]369->370 return {371 Output.BYTES:lambda:run_and_get_output(*(args + [True])),372 Output.DICT:lambda:{'text':run_and_get_output(* args)},〜/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py在< lambda>()中371 Output.BYTES:lambda:run_and_get_output(*(args + [True])),372 Output.DICT:lambda:{'text':run_and_get_output(* args)},->第373页374} [output_type]()375〜/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py在run_and_get_output中(图像,扩展名,lang,config,nice,超时,return_bytes)280}281->第282章283 filename = kwargs ['output_filename_base'] + extsep +扩展名284以open(filename,'rb')作为output_file:〜/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py在run_tesseract中(input_filename,output_filename_base,扩展名,lang,config,nice,timeout)256,其中timeout_manager(proc,timeout)作为error_string:257如果proc.returncode:->(258)第258章259260TesseractError:(1,'read_params_file:找不到参数:') 


感谢您的帮助.

解决方案

我建议使用正确的语言模型和最新版本:

对于Windows 10:

I want to use pytesseract Arabic And I have ara.traineddata in my system /usr/share/tesseract/tessdata/ path and i have already installed tesseract package

This is my code:

 import pytesseract
 from PIL import Image
 pytesseract.image_to_string(Image.open('test_arabic.png'), config='', lang="ara")

and i get this error:

TesseractError                            Traceback (most recent call last)

in

----> 1 pytesseract.image_to_string(Image.open('test_persian.png'), config='', lang="ara")

~/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py in image_to_string(image, lang, config, nice, output_type, timeout)
    368     args = [image, 'txt', lang, config, nice, timeout]
    369
--> 370     return {
    371         Output.BYTES: lambda: run_and_get_output(*(args + [True])),
    372         Output.DICT: lambda: {'text': run_and_get_output(*args)},

~/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py in <lambda>()
    371         Output.BYTES: lambda: run_and_get_output(*(args + [True])),
    372         Output.DICT: lambda: {'text': run_and_get_output(*args)},
--> 373         Output.STRING: lambda: run_and_get_output(*args),
    374     }[output_type]()
    375

~/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py in run_and_get_output(image, extension, lang, config, nice, timeout, return_bytes)
    280         }
    281
--> 282         run_tesseract(**kwargs)
    283         filename = kwargs['output_filename_base'] + extsep + extension
    284         with open(filename, 'rb') as output_file:

~/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py in run_tesseract(input_filename, output_filename_base, extension, lang, config, nice, timeout)
    256     with timeout_manager(proc, timeout) as error_string:
    257         if proc.returncode:
--> 258             raise TesseractError(proc.returncode, get_errors(error_string))
    259
    260

TesseractError: (1, 'read_params_file: parameter not found:')


Thanks for help.

解决方案

I suggest using the proper language model and the latest version:

For Windows 10:

tesseract-ocr-w64-setup-v5.0.0-alpha.20200328.exe (64 bit) resp.

To validate installation in the power shell or cmd terminal execute:

tesseract -v

It will output something like this: tesseract v5.0.0-alpha.20200328

For Mac OS:

brew install tesseract

To validate installation in the power shell or cmd terminal execute:

tesseract -v

It will output something like this: tesseract 4.1.1 and also the installed image librariesleptonica-1.80.0libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1Found AVX2Found AVXFound FMAFound SSE

If you are not sure about the path, then simply copy paste the ara.traindata file in the same folder as that of your Python .py file

import pytesseract
from PIL import Image
import os
os.environ["TESSDATA_PREFIX"] = "" # Leaving it empty because file is already copy pasted in the current directory
print(os.getenv("TESSDATA_PREFIX"))
# Copy paste the ara.traineddata file in the same directory as this python code
print(pytesseract.image_to_string(Image.open('cropped.png'), lang="ara"))

For Linux/Ubuntu OS:

sudo apt-get install tesseract-ocr

The validation and run code is same as that of Mac Os

Also make sure the path is fine.

This code works fine if the ara.traineddata file is downloaded successfully:

import pytesseract
from PIL import Image
print(pytesseract.image_to_string(Image.open('cropped.png'), lang="ara"))

You can follow this tutorial for details. Here is the demo output of this tutorial which uses Arabic language as well.

这篇关于为什么pytesseract引发阿拉伯语错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-23 11:04