问题描述
//编辑26.03.2018-想要继续我的工作的人可以查看我的源文件 https://github.com/n0l0cale/ocr-sampledata
// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
我实际上正在寻找有关PDF文件的一些详细信息.对我来说最重要的是,这些文件将可以使用很长时间,并且如果可能的话,应将OCR自动应用于新文件(这在Adobe Acrobat中似乎不太可能...).
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
为此,我一直在寻找不同的解决方案,如何对我的PDF文件进行OCR.我发现三个候选人似乎正在做他们应该做的事(或多或少).但是这三个变体都有其优点和缺点...但是似乎存在不同的方法来将数据存储在PDF文件中....对于所有三个变体...让我解释一下:
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
-
使用Adobe Acrobat的文件OCRed:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr -sampledata/blob/master/A4%20sample_ACROBAT.pdf
生成一个文件,Acrobat可以在一个步骤中打开该文件(不预加载任何背景层),并且在执行预检脚本之后,我可以看到隐藏的文本:
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
带有Abby Finereader的文件OCRed:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr -sampledata/blob/master/A4%20sample_ABBY.pdf
似乎不适合默认的adobe preflight-script,因为它不显示任何其他层:
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
但据我所知,这些文件似乎有一个Background-Text-Layer,其中包含OCRed Text,这是最后显示给用户的Image的基础层.不幸的是,这似乎是单独加载的,这在使用Adobe Acrobat打开文件时会造成混淆...
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
带有Tesseract 4(Alpha)的文件OCRed:
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr -sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
还在隐藏文本部分做一些奇怪的魔术:
is also doing some weird magic with the hidden text part:
但是在所有三种情况下,我都可以在文件中搜索单词,并使用删除隐藏的信息"并选择隐藏的文本"来查看文本:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
我很困惑....有人知道这些程序是如何真正存储其隐藏文本信息的吗?
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
PS:对于那些想知道这个不祥的印前检查脚本是什么的人:"> https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
推荐答案
您正确地发现Abby Finereader的方法不同于Adobe Acrobat和Tesseract的方法:
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
- Abby创建一个页面内容流,其中首先在页面上正常绘制文本,然后最终被扫描的图像覆盖.
- Acrobat和Tesseract创建内容流,其中首先绘制图像,然后不可见地绘制文本(使用文本绘制模式3,不绘制任何内容).
后两个结果之间的区别是所使用的字体的选择:
The difference between the latter two results is the choice of font used:
- Acrobat使用常规的标准14字体,PDF查看器具有标准的14种字体,该字体程序可将其呈现为普通字形.
- Tesseract使用字体 GlyphLessFont ,它将字体程序嵌入到结果文件中.呈现时,这种字体的字形不会显示为我们正常的拉丁字形,而只会显示为空白.
- Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
- Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
考虑到您观察到的Abby结果的视觉效果,Acrobat或Tesseract所采用的方法可能更可取.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
人们是否更喜欢带有视觉上可识别的字形的字体(如Acrobat所用)还是不带有(如Tesseract所用)的字体,基本上都只是味道问题.无论如何,它们仅在不可见的渲染模式下使用.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.
这篇关于隐藏的文本如何存储在OCR增强的PDF文件中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!