问题描述
我正在使用 PDFMiner 将.pdf文件转换为.xml文件.
I am converting .pdf files into .xml files using PDFMiner.
对于.pdf文件中的每个单词,PDFMiner都会检查它是否为Unicode(以及许多其他内容).如果是,它将返回字符,如果不是,它将引发异常并返回字符串(cid:%d)",其中%d是字符ID,我认为这是Unicode十进制.
For each word in the .pdf file, PDFMiner checks whether it is Unicode or not (among many other things). If it is, it returns the character, if it is not, it raises an exception and returns the string "(cid:%d)" where %d is the character id, which I think is the Unicode Decimal.
此问题的编辑部分对此进行了很好的解释: pdf2txt输出中的内容(cid:51)是什么? .为了方便起见,我在这里报告代码:
This is well explained in the edit part of this question:What is this (cid:51) in the output of pdf2txt?. I report the code here for convenience:
def render_char(self, matrix, font, fontsize, scaling, rise, cid):
try:
text = font.to_unichr(cid)
assert isinstance(text, unicode), text
except PDFUnicodeNotDefined:
text = self.handle_undefined_char(font, cid)
def handle_undefined_char(self, font, cid):
if self.debug:
print >>sys.stderr, 'undefined: %r, %r' % (font, cid)
return '(cid:%d)' % cid
对于用西里尔字母编写的.pdf文件,我通常会收到此异常.但是,有一个文件使用简单的英语,对于不间断的空格(cid = 160),我会收到此异常.我不明白为什么这个字符不能识别为Unicode,而同一文件中的所有其他字符却不能识别为Unicode.
I usually get this Exception for .pdf files written in Cyrillic. However, there is one file that uses plain English and where I get this Exception for non breaking spaces (that have cid=160). I do not understand why this character is not recognised as Unicode, while all others in the same file are.
如果在相同的环境中,我在控制台中运行isinstance(u'160', unicode)
,则得到True
,而当它在PDFMiner中运行时,(显然)等效的命令将返回False
.
If, on the same environment, I run isinstance(u'160', unicode)
in the console I get True
, while an (apparently) equivalent command is returning False
when it's run inside PDFMiner.
如果我进行调试,则可以正确识别字体,即得到:
If I debug, I see that the font is properly recognised, i.e. I get:
cid = 160
font = <PDFType1Font: basefont='Helvetica'>
PDFMiner接受编解码器作为参数.我选择了utf-8,它具有160作为不间断空格的Unicode十进制( http://dev.networkerror .org/utf8/).
PDFMiner accepts the codec as a parameter. I have chosen utf-8, which has 160 as Unicode Decimal for non breaking space (http://dev.networkerror.org/utf8/).
如果可能有帮助,请参见以下to_unichr的代码:
If it might help, here is the code for to_unichr:
def to_unichr(self, cid):
if self.unicode_map:
try:
return self.unicode_map.get_unichr(cid)
except KeyError:
pass
try:
return self.cid2unicode[cid]
except KeyError:
raise PDFUnicodeNotDefined(None, cid)
是否可以设置/更改代码识别的字符映射?
Is there a way to set/change the character map recognised by the code?
您认为我应该更改什么,或者您应该在哪里进行调查,以使cid = 160不会引发异常?
What do you think I should change, or where do you think I should investigate, so that cid=160 does not raise the Exception?
推荐答案
示例文档中涉及的字体是简单字体,并使用 WinAnsiEncoding .此编码在PDF规范 ISO 32000- 1 作为附件 D.2拉丁字符集和编码中的表格中的四种特殊编码之一.该表不在WIN
列中包含240个条目(=十进制160.这些表项以八进制数字给出!).
The font in question in the sample document is a Simple Font and uses WinAnsiEncoding. This encoding is defined in the PDF specification ISO 32000-1 as one of four special encodings in a table in Annex D.2 Latin Character Set and Encodings. This table does not contain an entry for 240 (= decimal 160. The table entries are given as octal numbers!) in the WIN
column.
此表被提取为 latin_enc.py 中的ENCODING
数组,并从该数组映射中在 encodingdb.py 中生成了这四种编码,然后使用,例如对于具有这种编码的字体,请参见 pdffont.py 中的PDFSimpleFont
.
This table is extracted as the ENCODING
array in latin_enc.py, and from this array maps for those four encodings are generated in encodingdb.py which then are used, e.g. for fonts with that very encoding, cf PDFSimpleFont
in pdffont.py.
因此,PdfMiner无法将代码160识别为在 WinAnsiEncoding 中具有任何关联的字符.这会导致您的问题.
Thus, the code 160 is not recognized by PdfMiner as having any associated character in WinAnsiEncoding. This causes your problem.
仅查看看似正确的表格,但如果阅读表格下方的注释,就会发现:
Only looking at the table that seems correct, but if one reads the notes below the table, one finds:
- 在 MacRomanEncoding 中,SPACE字符还应编码为312,而在 MacRomanEncoding 中应编码为240 WinAnsiEncoding .此重复代码应表示一个不间断的空格;它应该是印刷的 与(U + 003A)SPACE相同.
- The SPACE character shall also be encoded as 312 in MacRomanEncoding and as 240 in WinAnsiEncoding. This duplicate code shall signify a nonbreaking space; it shall be typographically the same as (U+003A) SPACE.
PdfMiner开发似乎忽略了这一点.
This seems to have been overlooked by PdfMiner development.
可以通过为space
('nbspace', None, 202, 160, None)
到ENCODING
数组(使用十进制数字);如果愿意,可以改用space
.
to the ENCODING
array (which is using decimal numbers); if you prefer, you might want to use space
instead.
(我说可能,因为我不喜欢Python编程,因此无法检查,尤其是检查是否有不良副作用.)
(I say might because I'm not into Python programming and, therefore, cannot check, in particular not for unwanted side effects.)
这篇关于为什么在PDFMiner中不能将字符ID 160识别为Unicode?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!