问题描述
我在每页包含 6 列的历史报纸上使用 OCR.目前我使用 FineReader 并为每列定义文本块.我想使用 Tesseract.Tesseract 得到的列大多是正确的,但每隔几行就会读入相邻的列.我想知道是否有一种方法可以设置它的参数,以便六列看起来非常僵硬.
根据其他问题的建议,我尝试使用
显然引擎正在制作一个包含缩进线的块,另一个包含齐平线.
确认这是flush行的文本输出:
trpops 的杂货店、酒吧和咖啡店驻扎在开罗城堡.在上午 10 点之前收到此项服务的投标,1906 年 14 日星期六,星期六.亲自向指挥官申请,Citadel,在上午 10 点到每天中午12点.——_——_——有没有办法将 tesseract 限制到某些列边界?(显然我可以通过剪切图像来做到这一点,但我想避免这项工作.)
you can user
psm 4 OEM 1
或 psm 4 oem 3获得更好的文字和准确性
I'm using OCR on historical newspapers that contain 6 columns per page. At present I use FineReader and define text blocks for each column. I'd like to use Tesseract. Tesseract gets the columns mostly right, but every few lines it reads into adjacent columns. I wonder if there's a way to set its parameters so that it will look quite rigidly for six columns.
Following suggestions on other questions, I've tried playing with --psm
and hocr without great success.
Working with a jpg I've posted on github, and converting it into a text-embedded pdf using this code tesseract 1906-07-02-p4.jpg out -l eng+fra --psm 1 pdf
I get this result:
Clearly the engine is making a bloc containing the indented lines, and another containing the flush lines.
Confirming this is the text output of the flush lines:
Grocery, Bar and Coffea shop of the trpops
stationed at the Citadel, Cairo.
to received tender for this service by 10 a.m.,
on Saturday, the 14th Jaly, 1906.
application in person to the Commandant,
Citadel, between the hours of 10 a.m. and
12 noon, daily.
—_—_——
Is there a way to constrain tesseract to certain column boundaries? (Obviously I could do this by cutting up the images but I'd like to avoid that work.)
you can user
psm 4 oem 1
or psm 4 oem 3to get better text and accuracy
这篇关于在tesseract OCR参数中定义多列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!