问题描述
我一直在使用 tesseract 将文档转换为文本.文档的质量参差不齐,我正在寻找有关哪种图像处理可能会改善结果的提示.我注意到高度像素化的文本——例如由传真机生成的文本——对于tesseract来说尤其难以处理——大概所有这些字符的锯齿状边缘都会混淆形状识别算法.
I've been using tesseract to convert documents into text. The quality of the documents ranges wildly, and I'm looking for tips on what sort of image processing might improve the results. I've noticed that text that is highly pixellated - for example that generated by fax machines - is especially difficult for tesseract to process - presumably all those jagged edges to the characters confound the shape-recognition algorithms.
什么样的图像处理技术可以提高准确性?我一直在使用高斯模糊来平滑像素化图像并看到一些小的改进,但我希望有一种更具体的技术可以产生更好的结果.假设一个过滤器调整为黑白图像,可以平滑不规则边缘,然后是一个过滤器,可以增加对比度,使字符更加清晰.
What sort of image processing techniques would improve the accuracy? I've been using a Gaussian blur to smooth out the pixellated images and seen some small improvement, but I'm hoping that there is a more specific technique that would yield better results. Say a filter that was tuned to black and white images, which would smooth out irregular edges, followed by a filter which would increase the contrast to make the characters more distinct.
对于图像处理新手,有什么一般提示吗?
Any general tips for someone who is a novice at image processing?
推荐答案
- 修复 DPI(如果需要)至少 300 DPI
- 修复文字大小(例如 12 pt 应该没问题)
- 尝试修复文本行(扭曲和扭曲文本)
- 尝试修复图像的照明(例如图像没有暗部)
- 图像二值化和去噪
没有适合所有情况的通用命令行(有时您需要模糊和锐化图像).但是您可以尝试 Fred 的 ImageMagick Scripts 中的 TEXTCLEANER.
There is no universal command line that would fit to all cases (sometimes you need to blur and sharpen image). But you can give a try to TEXTCLEANER from Fred's ImageMagick Scripts.
如果你不喜欢命令行,也许你可以尝试使用开源 scantailor.sourceforge.net 或商业 bookrestorer.
If you are not fan of command line, maybe you can try to use opensource scantailor.sourceforge.net or commercial bookrestorer.
这篇关于图像处理以提高tesseract OCR准确性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!