问题描述
我正在使用tesseract进行OCR.我能够使应用程序正常工作并获得输出.在这里,我试图从发票中提取数据并获取提取的数据.但是输入文件中单词之间的间距必须与输出文件中的相似.我现在正在获取每个单词和坐标.我需要根据坐标导出到文本文件中
I am working on OCR using tesseract. I am able to make the application working and get the output. Here i'm trying to extract data from an invoice bill and getting the extracted data. But the spacing between words in input has to be similar in output file.I am now getting each words and coordinates.I need to export to text file according to coordinates
代码示例:
using (var engine = new TesseractEngine(Server.MapPath(@"~/tessdata"), "eng", EngineMode.Default))
{
engine.DefaultPageSegMode = PageSegMode.AutoOsd;
// have to load Pix via a bitmap since Pix doesn't support loading a stream.
using (var image = new System.Drawing.Bitmap(imageFile.PostedFile.InputStream))
{
Bitmap bmp = Resize(image, 1920, 1080);
using (var pix = PixConverter.ToPix(image))
{
using (var page = engine.Process(pix))
{
using (var iter = page.GetIterator())
{
iter.Begin();
do
{
Rect symbolBounds;
string path = Server.MapPath("~/Output/data.txt");
if (iter.TryGetBoundingBox(PageIteratorLevel.Word, out symbolBounds))
{
// do whatever you want with bounding box for the symbol
var curText = iter.GetText(PageIteratorLevel.Word);
//WriteToTextFile(curText, symbolBounds, path);
resultText.InnerText += curText;
// Your code here, 'rect' should containt the location of the text, 'curText' contains the actual text itself
}
} while (iter.Next(PageIteratorLevel.Word));
}
meanConfidenceLabel.InnerText = String.Format("{0:P}", page.GetMeanConfidence());
}
}
}
}
这是一个输入和输出示例,显示了错误的间距.
Here is an example of input and output showing the wrong spacing.
推荐答案
您可以使用page.GetIterator()
循环浏览页面中找到的项目.对于单个项目,您可以得到一个边界框",它是一个Tesseract.Rect
(矩形结构),其中包含:X1
,Y1
,X2
,Y2
坐标.
You can loop through found items in the page using page.GetIterator()
. For the individual items you can get a 'bounding box', this is a Tesseract.Rect
(rectangle struct) which contains: X1
, Y1
, X2
, Y2
coordinates.
Tesseract.PageIteratorLevel myLevel = /*TODO*/;
using (var page = Engine.Process(img))
using (var iter = page.GetIterator())
{
iter.Begin();
do
{
if (iter.TryGetBoundingBox(myLevel, out var rect))
{
var curText = iter.GetText(myLevel);
// Your code here, 'rect' should containt the location of the text, 'curText' contains the actual text itself
}
} while (iter.Next(myLevel));
}
没有明确的方法可以使用输入中的位置来分隔输出中的文本.您将必须为此编写一些自定义逻辑.
There is no clear-cut way to use the positions in the input to space the text in the output. You're going to have to write some custom logic for that.
您可以使用以下类似的代码来估算文本左侧所需的空格数:
You might be able to estimate the number of spaces you need to the left of your text with something like this:
var padLeftSpaces = (int)Math.Round((rect.X1 / inputWidth) * outputWidthSpaces);
这篇关于Tesseract OCR文字位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!