问题描述
我正在写硕士论文-NLP系统.我只有一个组件-提取器.
I am writing a Master's thesis - NLP system. I have one component - extractor.
它正在从PDF文件中提取纯文本.有一些无法正确提取的PDF文件.提取程序(PDFBox库)返回这样的字符串:
It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this:
或
我正在检查造成提取问题的每个文件,并且所有这些文件的文本也无法从PDF Reader(Adobe Reader和FoxIt reader)中复制粘贴.启用了在此阅读器中查看它们的功能,但是在选择其内容并将其复制到剪贴板后,我得到了相同的错误文本(如上所述-字符串在语义上不正确,字符或数字和字母在字符串中).
I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semantically correct chars or strings of digits and letters).
有人可以帮助我吗?
推荐答案
在这种情况下,通常无法选择从Acrobat(阅读器)窗口中复制粘贴文本的情况,这时可能会有另一种选择仍然可以工作:
Very often in such cases, where you can't select, copy'n'paste text from the Acrobat (Reader) window, there is another option which may work nevertheless:
- 打开文件" 菜单,
- 选择另存为..." ,
- 选择文本(普通)(*.txt)" ,
- 浏览到目标目录,
- 键入您要用于文本文件的名称.
- Open 'File' menu,
- select 'Save as...',
- select 'Text (normal) (*.txt)',
- browse to the target directory,
- type the name you want to use for the text file.
您将拥有文件中所有页面的所有文本,并且需要找到要初始复制的位置,因为它不如直接复制的舒适.但是它更可靠地工作....
You'll have all text from all pages in the file and need to locate the spot you wanted to copy'n'paste initially -- insofar it is not as comfortable as direct copy'n'paste. But it works more reliably....
它在Linux上也可以与acroread
一起使用(但是您必须从文件菜单中选择另存为文本..." ).
It also works with acroread
on Linux (but you have to choose 'Save as text...' from the file menu).
您可以使用pdffonts
命令行实用工具来快速分析PDF所使用的字体.
You can use the pdffonts
command line utility to get a quick-shot analysis of the fonts used by a PDF.
这是示例输出,该示例演示了很可能在何处发生文本提取问题.它使用来自 GitHub存储库 ,其创建目的是提供带有注释且可以在文本编辑器中轻松打开的PDF示例文件:
Here is an example output, which demonstrates where a problem for text extraction will very likely occur. It uses one of these hand-coded PDF files from a GitHub-Repository which was created to provide PDF sample files which are well commented and may easily be opened in a text editor:
$ pdffonts textextract-bad2.pdf
name type encoding emb sub uni object ID
------------------------------- ------------ ----------- --- --- --- ---------
BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0
CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0
如何解释此表?
- 上面的PDF文件使用两个子集的字体(如名称的
BAAAAA+
和CAAAAA+
前缀以及sub
列中的yes
条目所示),Helvetica
和Helvtica-Bold
. - 两种字体的类型均为
TrueType
. - 两种字体均使用
WinAnsi
编码(一种字体编码将PDF源代码中使用的char标识符映射到应绘制的字形).但是,仅对于字体/Helvetica
,PDF内有一个/ToUnicode
表可用(对于/Helvetica-Bold
没有任何表),如uni
列中的yes
/no
所示. /li>
- The above PDF file uses two subsetted fonts (as indicated by the
BAAAAA+
andCAAAAA+
prefixes to their names, as well as by theyes
entries in thesub
column),Helvetica
andHelvtica-Bold
. - Both fonts are of type
TrueType
. - Both fonts use a
WinAnsi
encoding (a font encoding maps char identifiers used in the PDF source code to glyphs that should be drawn).However, only for font/Helvetica
there is a/ToUnicode
table available inside the PDF (for/Helvetica-Bold
there is none), as indicated by theyes
/no
in theuni
-column).
/ToUnicode
表是必需的,以提供从字符标识符/代码到字符的反向映射.
The /ToUnicode
table is required to provide a reverse mapping from character identifiers/codes to characters.
缺少特定字体的/ToUnicode
表几乎总是可以确保不能使用PDF提取或复制使用该字体的文本字符串. (即使/ToUnicode
表在那里,文本提取仍可能会带来问题,因为此表可能已损坏,不正确或不完整-如许多实际PDF文件中所示,上面链接的GitHub存储库中的一些伴随文件也对此进行了演示.)
A missing /ToUnicode
table for a specific font is almost always a sure indicator that text strings using this font cannot be extracted or copied'n'pasted from the PDF. (Even if a /ToUnicode
table is there, text extraction may still pose a problem, because this table may be damaged, incorrect or incomplete -- as seen in many real-world PDF files, and as also demonstrated by a few companion files in the above linked GitHub repository.)
这篇关于从PDF复制粘贴文本会导致垃圾回收的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!