问题描述
我正在开发C#应用程序,其中将PDF文档转换为图像,然后在自定义查看器中呈现该图像.
I am developing a C# application in which I am converting a PDF document to an image and then rendering that image in a custom viewer.
在尝试搜索生成的图像中的特定单词时,我碰到了一堵砖墙,我想知道这样做的最佳方法是什么.我应该找到搜索词的x,y位置吗?
I've come across a bit of a brick wall when trying to search for specific words in the generated image and I was wondering what the best way to go about this would be. Should I find the x,y location of searched word?
推荐答案
您可以使用 tessract在控制台模式下用于文本识别的OCR图像.
我不知道用于pdf的SDK.
I don't know about such SDK for pdf.
但是,如果要获取所有单词的坐标和值,则可以使用下一个我不复杂的代码,谢谢中的hocr提示:
BUT, if you want to get all word coordinates and values, you can use next my not complex code, thank nguyenq for hocr hint:
public void Recognize(Bitmap bitmap)
{
bitmap.Save("temp.png", ImageFormat.Png);
var startInfo = new ProcessStartInfo("tesseract.exe", "temp.png temp hocr");
startInfo.WindowStyle = ProcessWindowStyle.Hidden;
var process = Process.Start(startInfo);
process.WaitForExit();
GetWords(File.ReadAllText("temp.html"));
// Futher actions with words
}
public Dictionary<Rectangle, string> GetWords(string tesseractHtml)
{
var xml = XDocument.Parse(tesseractHtml);
var rectsWords = new Dictionary<System.Drawing.Rectangle, string>();
var ocr_words = xml.Descendants("span").Where(element => element.Attribute("class").Value == "ocr_word").ToList();
foreach (var ocr_word in ocr_words)
{
var strs = ocr_word.Attribute("title").Value.Split(' ');
int left = int.Parse(strs[1]);
int top = int.Parse(strs[2]);
int width = int.Parse(strs[3]) - left + 1;
int height = int.Parse(strs[4]) - top + 1;
rectsWords.Add(new Rectangle(left, top, width, height), ocr_word.Value);
}
return rectsWords;
}
这篇关于如何从pdf图像中查找文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!