我试图在一个PDF文件上创建索引,该文件作为旧原始手稿的图像进行扫描,然后在Adobe Acrobat Pro中进行字符识别.问题是某些单词的间隔很滑稽,因此OCR最终有缺陷.我使用了查找并修复可疑工具,但是仍然存在问题.
I'm trying to create an index on a PDF file that I scanned as images from an old original manuscript, then put through character recognition in Adobe Acrobat Pro. The problem is some of the words were spaced funny so the OCR ended up with flaws. I used the find and fix suspects tool but there are still problems.
在原始文档(当然还有它的图像)中,文本"示例"用有趣的空格隔开,以便Adobe将其读为三个词"示例"然后,如果没有更好的了解,该词就会为" ample "一词创建一个索引条目,该条目看起来非常有效.这是到目前为止我已经确定的文档中的几个类似问题之一(还有更多页面需要校对).
The text "FOR EXAMPLE" was spaced funny in the original document (and its image of course) so that Adobe reads it as three words "FOR EX AMPLE" which then results in an index entry for the word "ample" that looks perfectly valid if I did not know better. This is one of several similar problems with the document that I have identified so far (still more pages to proofread).
在搜索文档时,如何修复基础OCR文本,以使其在创建的索引和中同时包含正确的信息 .
How can I fix the underlying OCR text so that it contains the correct information both in the created index and when searching the document.
PS: I cannot just switch to a pure OCR text version of the document since the manuscript is technical and has lots of drawings associated with the text. I need to keep the images and alter the "hidden" searchable text underneath.
我发现此答案建议 ABBYY FineReader 14 (商业;我不隶属于).看起来它将处理编辑工作,然后我假定您现有的工作流程将负责编制索引. 此处是给出了更多工作流程详细信息的另一个答案(尽管是三年前).
I found this answer suggesting ABBYY FineReader 14 (commercial; I am not affiliated). It looks like it will handle the editing, after which I presume your existing workflow would take care of the indexing. Here is another answer giving some more workflow details (albeit three years ago).
另外,此问题的答案表明Perl的"> CAM :: PDF 和 pdftk .
Separately, this question has answers suggesting Perl's CAM::PDF and pdftk.
这篇关于PDF:如何覆盖/修复扫描图像+ OCR文件中的可搜索文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!