问题描述
我通过 pytesseract
绑定将 tesseract
用于 OCR.不幸的是,我在尝试提取包含下标样式数字的文本时遇到了困难 - 下标数字被解释为一个字母.
I am using tesseract
for OCR, via the pytesseract
bindings. Unfortunately, I encounter difficulties when trying to extract text including subscript-style numbers - the subscript number is interpreted as a letter instead.
例如在基本图像中:
我想将文本提取为CH3",即我不担心知道数字 3
是图像中的下标.
I want to extract the text as "CH3", i.e. I am not concerned about knowing that the number 3
was a subscript in the image.
我使用 tesseract
对此的尝试是:
My attempt at this using tesseract
is:
import cv2
import pytesseract
img = cv2.imread('test.jpeg')
# Note that I have reduced the region of interest to the known
# text portion of the image
text = pytesseract.image_to_string(
img[200:300, 200:320], config='-l eng --oem 1 --psm 13'
)
print(text)
不幸的是,这会错误地输出
Unfortunately, this will incorrectly output
'CHs'
也有可能获得 'CHa'
,具体取决于 psm
参数.
It's also possible to get 'CHa'
, depending on the psm
parameter.
我怀疑这个问题与文本的基线"在整个行中不一致有关,但我不确定.
I suspect that this issue is related to the "baseline" of the text being inconsistent across the line, but I'm not certain.
如何从这种类型的图像中准确提取文本?
How can I accurately extract the text from this type of image?
更新 - 2020 年 5 月 19 日
在看到 Achintha Ihalage 的回答后,它没有为 tesseract
提供任何配置选项,我探索了 psm
选项.
After seeing Achintha Ihalage's answer, which doesn't provide any configuration options to tesseract
, I explored the psm
options.
由于感兴趣的区域是已知的(在这种情况下,我使用 EAST 检测来定位文本的边界框),tesseract
的 psm
配置选项,在我的原始代码中将文本视为一行,可能没有必要.对上面边界框给出的感兴趣区域运行 image_to_string
给出输出
Since the region of interest is known (in this case, I am using EAST detection to locate the bounding box of the text), the psm
config option for tesseract
, which in my original code treats the text as a single line, may not be necessary. Running image_to_string
against the region of interest given by the bounding box above gives the output
CH
3
当然可以轻松处理以获得CH3
.
which can, of course, be easily processed to get CH3
.
推荐答案
您希望在将图像输入 tesseract
之前对图像进行预处理,以提高 OCR 的准确性.我在这里使用 PIL
和 cv2
的组合来做到这一点,因为 cv2
有很好的过滤器来去除模糊/噪声(膨胀、腐蚀、阈值)和 PIL
可以轻松增强对比度(将文本与背景区分开来),我想展示如何使用任何一种来完成预处理......(两者一起使用不是 100%虽然必要,如下所示).你可以写得更优雅——这只是一般的想法.
You want to do apply pre-processing to your image before feeding it into tesseract
to increase the accuracy of the OCR. I use a combination of PIL
and cv2
to do this here because cv2
has good filters for blur/noise removal (dilation, erosion, threshold) and PIL
makes it easy to enhance the contrast (distinguish the text from the background) and I wanted to show how pre-processing could be done using either... (use of both together is not 100% necessary though, as shown below). You can write this more elegantly- it's just the general idea.
import cv2
import pytesseract
import numpy as np
from PIL import Image, ImageEnhance
img = cv2.imread('test.jpg')
def cv2_preprocess(image_path):
img = cv2.imread(image_path)
# convert to black and white if not already
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# remove noise
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
# apply a blur
# gaussian noise
img = cv2.threshold(cv2.GaussianBlur(img, (9, 9), 0), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
# this can be used for salt and pepper noise (not necessary here)
#img = cv2.adaptiveThreshold(cv2.medianBlur(img, 7), 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
cv2.imwrite('new.jpg', img)
return 'new.jpg'
def pil_enhance(image_path):
image = Image.open(image_path)
contrast = ImageEnhance.Contrast(image)
contrast.enhance(2).save('new2.jpg')
return 'new2.jpg'
img = cv2.imread(pil_enhance(cv2_preprocess('test.jpg')))
text = pytesseract.image_to_string(img)
print(text)
输出:
CH3
cv2
预处理生成的图像如下所示:
The cv2
pre-process produces an image that looks like this:
PIL
的增强为您提供:
在这个特定示例中,您实际上可以在 cv2_preprocess
步骤之后停止,因为这对读者来说已经足够清楚了:
In this specific example, you can actually stop after the cv2_preprocess
step because that is clear enough for the reader:
img = cv2.imread(cv2_preprocess('test.jpg'))
text = pytesseract.image_to_string(img)
print(text)
输出:
CH3
但是,如果您正在处理的内容不一定以白色背景开始(即灰度缩放转换为浅灰色而不是白色)- 我发现 PIL
步骤确实有帮助.
But if you are working with things that don't necessarily start with a white background (i.e. grey scaling converts to light grey instead of white)- I have found the PIL
step really helps there.
主要是提高tesseract
准确性的方法通常是:
Main point is the methods to increase accuracy of the tesseract
typically are:
- 修复 DPI(重新缩放)
- 修复图像的亮度/噪点
- 修复 tex 大小/线条(倾斜/扭曲文本)
执行其中一项或全部三项会有所帮助……但亮度/噪声比其他两项更具普遍性(至少根据我的经验).
Doing one of these or all three of them will help... but the brightness/noise can be more generalizable than the other two (at least from my experience).
这篇关于如何使用 OCR 检测图像中的下标数字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!