I am using tesseract for OCR, via the pytesseract bindings. Unfortunately, I encounter difficulties when trying to extract text including subscript-style numbers - the subscript number is interpreted as a letter instead.


I want to extract the text as "CH3", i.e. I am not concerned about knowing that the number 3 was a subscript in the image.

My attempt at this using tesseract is:

import cv2
import pytesseract

img = cv2.imread('test.jpeg')

# Note that I have reduced the region of interest to the known
# text portion of the image
text = pytesseract.image_to_string(
    img[200:300, 200:320], config='-l eng --oem 1 --psm 13'


Unfortunately, this will incorrectly output


It's also possible to get 'CHa', depending on the psm parameter.


I suspect that this issue is related to the "baseline" of the text being inconsistent across the line, but I'm not certain.


How can I accurately extract the text from this type of image?

更新 - 2020 年 5 月 19 日

After seeing Achintha Ihalage's answer, which doesn't provide any configuration options to tesseract, I explored the psm options.

Since the region of interest is known (in this case, I am using EAST detection to locate the bounding box of the text), the psm config option for tesseract, which in my original code treats the text as a single line, may not be necessary. Running image_to_string against the region of interest given by the bounding box above gives the output




which can, of course, be easily processed to get CH3.


You want to do apply pre-processing to your image before feeding it into tesseract to increase the accuracy of the OCR. I use a combination of PIL and cv2 to do this here because cv2 has good filters for blur/noise removal (dilation, erosion, threshold) and PIL makes it easy to enhance the contrast (distinguish the text from the background) and I wanted to show how pre-processing could be done using either... (use of both together is not 100% necessary though, as shown below). You can write this more elegantly- it's just the general idea.

import cv2
import pytesseract
import numpy as np
from PIL import Image, ImageEnhance

img = cv2.imread('test.jpg')

def cv2_preprocess(image_path):
  img = cv2.imread(image_path)

  # convert to black and white if not already
  img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

  # remove noise
  kernel = np.ones((1, 1), np.uint8)
  img = cv2.dilate(img, kernel, iterations=1)
  img = cv2.erode(img, kernel, iterations=1)

  # apply a blur
  # gaussian noise
  img = cv2.threshold(cv2.GaussianBlur(img, (9, 9), 0), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

  # this can be used for salt and pepper noise (not necessary here)
  #img = cv2.adaptiveThreshold(cv2.medianBlur(img, 7), 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)

  cv2.imwrite('new.jpg', img)
  return 'new.jpg'

def pil_enhance(image_path):
  image = Image.open(image_path)
  contrast = ImageEnhance.Contrast(image)
  return 'new2.jpg'

img = cv2.imread(pil_enhance(cv2_preprocess('test.jpg')))

text = pytesseract.image_to_string(img)



The cv2 pre-process produces an image that looks like this:

在这个特定示例中,您实际上可以在 cv2_preprocess 步骤之后停止,因为这对读者来说已经足够清楚了:

In this specific example, you can actually stop after the cv2_preprocess step because that is clear enough for the reader:

img = cv2.imread(cv2_preprocess('test.jpg'))
text = pytesseract.image_to_string(img)



But if you are working with things that don't necessarily start with a white background (i.e. grey scaling converts to light grey instead of white)- I have found the PIL step really helps there.


Main point is the methods to increase accuracy of the tesseract typically are:

Doing one of these or all three of them will help... but the brightness/noise can be more generalizable than the other two (at least from my experience).

