本文介绍了图像处理,以提高tesseract OCR的准确性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!


我一直在使用tesseract将文档转换为文本。文档的质量范围非常广泛,我正在寻找有关哪种图像处理可能会改善结果的提示。我注意到高度像素化的文本 - 例如由传真机生成的文本 - 对于tesseract来说特别难以处理 - 可能是角色的所有锯齿状边缘都会混淆形状识别算法。

I've been using tesseract to convert documents into text. The quality of the documents ranges wildly, and I'm looking for tips on what sort of image processing might improve the results. I've noticed that text that is highly pixellated - for example that generated by fax machines - is especially difficult for tesseract to process - presumably all those jagged edges to the characters confound the shape-recognition algorithms.


What sort of image processing techniques would improve the accuracy? I've been using a Gaussian blur to smooth out the pixellated images and seen some small improvement, but I'm hoping that there is a more specific technique that would yield better results. Say a filter that was tuned to black and white images, which would smooth out irregular edges, followed by a filter which would increase the contrast to make the characters more distinct.


Any general tips for someone who is a novice at image processing?


  1. 修复DPI(如果需要)300 DPI最低

  2. 修复文字大小(例如12磅应该没问题)

  3. 尝试修复文本行(去偏移和去除文本)

  4. 尝试修复图像的照明(例如,没有图像的暗部分)

  5. 二值化和去噪图像

  1. fix DPI (if needed) 300 DPI is minimum
  2. fix text size (e.g. 12 pt should be ok)
  3. try to fix text lines (deskew and dewarp text)
  4. try to fix illumination of image (e.g. no dark part of image)
  5. binarize and de-noise image


There is no universal command line that would fit to all cases (sometimes you need to blur and sharpen image). But you can give a try to TEXTCLEANER from Fred's ImageMagick Scripts.

如果您不是命令行的粉丝,也许您可​​以尝试使用opensource 或商业。

If you are not fan of command line, maybe you can try to use opensource scantailor.sourceforge.net or commercial bookrestorer.

这篇关于图像处理,以提高tesseract OCR的准确性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-29 23:56