本文介绍了Tesseract-空间和制表符中的歧义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个tiff文件,其中包含一些由制表符分隔的文本(4个空格).但是,当我从该tiff图像文件中提取文本时,我总是在两列之间得到一个空格.一个示例示例:

I had a tiff file, which contain some text separated by tabs (4 spaces). But when I extract text out of this tiff image file, i always get a single space between two columns. A sample example:

TIFF IMAGE:
col-a    col-b    col-c

desired output:
col-a    col-b    col-c

but I am getting the following:
col-a col-b col-c

我尝试使用相同格式的多个图像进行此操作,但结果始终相同.如何解决此问题?我可以训练tesseract了解这一点吗?

I tried this with multiple images of same format, but the result is always the same.How do I fix this issue ? Can I train tesseract to understand this?

推荐答案

Tesseract将连续的空格压缩为一个.您将需要修改baseapi.cpp以保留空格.可以在以下帖子中找到代码更改:

Tesseract compresses consecutive spaces into one. You would need to modify baseapi.cpp to preserve the spaces. The code change can be found in the following posts:

https://groups.google.com/forum/#!searchin/tesseract-ocr/spaces/tesseract-ocr/lGBQiryHcrY/wy5a-L9O3i4J

https://groups. google.com/forum/#!searchin/tesseract-ocr/spaces/tesseract-ocr/9nzPrBZ3118/b3W5GtsFPo0J

这篇关于Tesseract-空间和制表符中的歧义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-16 10:25