问题描述
我有使用filetotext类从PDF提取文本的代码.一直工作到上周,直到生成的pdf发生了变化.奇怪的是,一旦我在字符的ord中添加29,字符就会出现并且正确.
I have code that extracts text from a PDF using a filetotext class. Worked until last week when something changed in the pdf's being generated. Weird thing is that it appears the characters are there and correct once I add 29 to the ord of the character.
示例响应调试打印输出:
Example response debug printout:
/F1 7.31 Tf
0 0 0 rg
1 0 0 1 195.16 597.4 Tm
($PRXQW)Tj
ET
BT
代码在pdf的stream部分使用gzuncompress.$ PRXQW是Amount,向每个字符的ord添加29dec就可以了.但是有时字符不是这种精确的翻译,例如文本中的)应该是5C66的两个字节.
The code uses gzuncompress on the stream section of the pdf.The $PRXQW is Amount, and adding 29dec to the ord of each character gives me this. But sometimes a character will not be this exact translation, such as what should be a ) in the text appears to be two bytes of 5C66.
只是想知道现在从PDF中出来的这种代码环字符,是否有人看过这种东西?
Just wondering about this code ring type of character coming out of PDF's now and if anyone has seen this kind of thing?
推荐答案
Tj 操作的字符串参数的编码完全取决于所使用的PDF字体( F1 (在手边的情况下):
The encoding of the string argument of the Tj operation depends entirely on the PDF font used (F1 in the case at hand):
使用简单字体时,字符串的每个字节均应视为单独的字符代码.然后应按照字体的编码查找字符代码,以选择字形,如9.6.6字符编码"中所述.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
使用复合字体(PDF 1.2)时,可以使用多字节代码来选择字形.在这种情况下,字符串的一个或多个连续字节应被视为单个字符代码.代码长度和从代码到字形的映射在称为CMap的数据结构中定义,如9.7复合字体"中所述.
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
( ISO 32000-1 )
OP的代码似乎采用了 MacRomanEncoding 或 WinAnsiEncoding 之类的标准编码,但这只是特殊情况.如上面引文所述,编码也可能是一些特殊的混合多字节编码.
The OP's code seems to assume a standard encoding like MacRomanEncoding or WinAnsiEncoding, but these merely are special cases. As indicated in the quote above, the encoding might as well be some ad-hoc mixed multibyte encoding.
后面的部分中的PDF规范描述了如何正确提取文本:
The PDF specification in a later section describes how to properly extract text:
-
如果字体词典包含 ToUnicode CMap(请参见9.10.3,"ToUnicode CMaps"),请使用该CMap将字符代码转换为Unicode.
If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.
如果该字体是使用预定义编码之一的简单字体 MacRomanEncoding , MacExpertEncoding 或 WinAnsiEncoding ,或者的编码格式,其Differences数组仅包含取自Adobe标准拉丁字符集的字符名称和采用Symbol字体的命名字符集(请参见附录D):
If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):
a)根据表D.1和字体的差异数组将字符代码映射到字符名称.
a) Map the character code to a character name according to Table D.1 and the font’s Differences array.
b)在 Adobe字形列表(请参见参考书目)中查找字符名称,以获得相应的Unicode值.
b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.
如果该字体是使用表118中列出的预定义CMap之一(Identity–H和Identity–V除外)的复合字体,或者其后代CIDFont使用Adobe-GB1,Adobe-CNS1,Adobe- Japan1或Adobe-Korea1字符集:
If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:
a)根据字体的CMap将字符代码映射到字符标识符(CID).
a) Map the character code to a character identifier (CID) according to the font’s CMap.
b)从其 CIDSystemInfo 词典中获取字体的CMap(例如Adobe和Japan1)使用的字符集的注册表和顺序.
b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.
c)通过将注册表和在步骤(b)中获得的命令以注册表-排序-UCS2的格式(例如Adobe-Japan1-UCS2)连接起来,构造第二个CMap名称.
c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).
d)获取具有在步骤(c)中构造的名称的CMap(可从ASN网站获得;请参见参考书目).
d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).
e)根据在步骤(d)中获得的CMap映射在步骤(a)中获得的CID,从而产生Unicode值.
e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.
如果这些方法无法产生Unicode值,则无法确定字符代码代表什么,在这种情况下,合格的读者可以选择自己选择的字符代码.
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.
( ISO 32000-1 )
因此:
是的,从头到尾在PDF中都非常普遍,其文本绘图操作符字符串参数的编码方式与ASCII形式的编码完全不同.正如上面第二个引号中的最后一段所暗示的那样,即使有其他地方可以寻找到Unicode的映射,也存在根本不允许文本提取(即没有OCR)的情况.
Yes, it is fairly common in PDFs from the wild to have text drawing operator string arguments in an encoding entirely different from something ASCII'ish. And as the last paragraph in the second quote above hints at, there are situation not allowing text extraction at all (without OCR, that is), even though there are additional places one can look for the mapping to Unicode.
这篇关于PHP过滤器FlateDecode PDF流返回偏移字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!