克服字典密钥编码

克服字典密钥编码

本文介绍了PDFBox 2.0:克服字典密钥编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Apache PDFBox 2.0.1从PDF表单中提取文本,并提取AcroForm字段的详细信息。我从单选按钮字段中挖掘外观字典。我对/ N和/ D条目(正常和向下外观)感兴趣。像这样(交互式Bean外壳):

I am extracting text from PDF forms with Apache PDFBox 2.0.1, extracting the details of AcroForm fields. From a radio button field I dig up the appearance dictionary. I'm interested in the /N and /D entries (normal and "down" appearance). Like this (interactive Bean shell):

field = form.getField(fieldName);
widgets = field.getWidgets();
print("Field Name: " + field.getPartialName() + " (" + widgets.size() + ")");
for (annot : widgets) {
  ap = annot.getAppearance();
  keys = ap.getCOSObject().getDictionaryObject("N").keySet();
  keyList = new ArrayList(keys.size());
  for (cosKey : keys) {keyList.add(cosKey.getName());}
  print(String.join("|", keyList));
}

输出为

Field Name: Krematorier (6)
Off|Skogskrem
Off|R�cksta
Off|Silverdal
Off|Stork�llan
Off|St Botvid
Nyn�shamn|Off

问号斑点应为瑞典字符ä或å。使用iText RUPS,我可以看到字典键是用ISO-8859-1编码的,而PDFBox假定它们是Unicode。

The question mark blotches should be Swedish characters "ä" or "å". Using iText RUPS I can see that the dictionary keys are encoded with ISO-8859-1 while PDFBox assumes they are Unicode, I guess.

有什么方法可以解码键使用ISO-8859-1?还是以其他方式正确检索密钥?

Is there any way of decoding the keys using ISO-8859-1? Or any other way to retrieve the keys correctly?

可以在此处下载此示例PDF表单:

This sample PDF form can be downloaded here: http://www.stockholm.se/PageFiles/85478/KYF%20211%20Best%C3%A4llning%202014.pdf

推荐答案

有什么方法可以使用ISO-8859-1解码密钥吗?还是任何其他正确检索密钥的方法?

Is there any way of decoding the keys using ISO-8859-1? Or any other way to retrieve the keys correctly?


更改假定的编码


PDFBox对编码的解释从源PDF读取名称时,名称中的字节数(仅名称可以用作PDF中的字典键)在 BaseParser.parseCOSName()中发生:

/**
 * This will parse a PDF name from the stream.
 *
 * @return The parsed PDF name.
 * @throws IOException If there is an error reading from the stream.
 */
protected COSName parseCOSName() throws IOException
{
    readExpectedChar('/');
    ByteArrayOutputStream buffer = new ByteArrayOutputStream();
    int c = seqSource.read();
    while (c != -1)
    {
        int ch = c;
        if (ch == '#')
        {
            int ch1 = seqSource.read();
            int ch2 = seqSource.read();
            if (isHexDigit((char)ch1) && isHexDigit((char)ch2))
            {
                String hex = "" + (char)ch1 + (char)ch2;
                try
                {
                    buffer.write(Integer.parseInt(hex, 16));
                }
                catch (NumberFormatException e)
                {
                    throw new IOException("Error: expected hex digit, actual='" + hex + "'", e);
                }
                c = seqSource.read();
            }
            else
            {
                // check for premature EOF
                if (ch2 == -1 || ch1 == -1)
                {
                    LOG.error("Premature EOF in BaseParser#parseCOSName");
                    c = -1;
                    break;
                }
                seqSource.unread(ch2);
                c = ch1;
                buffer.write(ch);
            }
        }
        else if (isEndOfName(ch))
        {
            break;
        }
        else
        {
            buffer.write(ch);
            c = seqSource.read();
        }
    }
    if (c != -1)
    {
        seqSource.unread(c);
    }
    String string = new String(buffer.toByteArray(), Charsets.UTF_8);
    return COSName.getPDFName(string);
}

如您所见,在读取了名称字节并解释了#转义序列后,PDFBox无条件地解释了结果字节以UTF-8编码。因此,要更改此设置,必须修补此PDFBox类并替换底部命名的字符集。

As you can see, after reading the name bytes and interpreting the # escape sequences, PDFBox unconditionally interprets the resulting bytes as UTF-8 encoded. To change this, therefore, you have to patch this PDFBox class and replace the charset named at the bottom.

根据根据规范,当将名称对象视为文本时

According to the specification, when treating a name object as text

(第7.3.5节名称对象,)

BaseParser.parseCOSName()就是这样实现的。

PDFBox的实现不是完全正确的,因为已经不需要将名称解释为字符串的行为是错误的:

PDFBox' implementation is not completely correct, though, as already the act of interpreting the name as string without need is wrong:

因此,PDF库应尽可能将名称作为字节数组处理,并且只能找到一个明确表示需要使用字符串表示形式时,只有以上建议(假定UTF-8)才起作用。该规范甚至指出了可能在哪里引起麻烦的地方:

Thus, PDF libraries should handle names as byte arrays as long as possible and only find a string representation when it is explicitly required, and only then the recommendation above (to assume UTF-8) should play a role. The specification even indicates where this may cause trouble:

很明显,在手头的文档中,如果字节序列不构成有效的UTF-8,则它仍然是有效的名称。但是通过上面的方法会更改这样的名称,任何无法解析的字节或子序列都将被Unicode替换字符。''替换。因此,不同的名称可能会合并为一个。

Another situation becomes apparent in the document at hand, if the sequence of bytes constitutes no valid UTF-8, it still is a valid name. But such names are changed by the method above, any unparsable byte or subsequence is replaced by the Unicode Replacement Character '�'. Thus, different names may collapse into a single one.

另一个问题是,在写回PDF时,PDFBox不是对称地起作用,而是解释使用名称 US_ASCII 的字符串表示形式(如果从PDF中读取,则表示为UTF-8解释), cf. COSName.writePDF(OutputStream)

Another issue is that when writing back a PDF, PDFBox is not acting symmetrically but instead interprets the String representation of the name (which has been retrieved as a UTF-8 interpretation if read from a PDF) using pure US_ASCII, cf. COSName.writePDF(OutputStream):

public void writePDF(OutputStream output) throws IOException
{
    output.write('/');
    byte[] bytes = getName().getBytes(Charsets.US_ASCII);
    for (byte b : bytes)
    {
        int current = (b + 256) % 256;

        // be more restrictive than the PDF spec, "Name Objects", see PDFBOX-2073
        if (current >= 'A' && current <= 'Z' ||
                current >= 'a' && current <= 'z' ||
                current >= '0' && current <= '9' ||
                current == '+' ||
                current == '-' ||
                current == '_' ||
                current == '@' ||
                current == '*' ||
                current == '$' ||
                current == ';' ||
                current == '.')
        {
            output.write(current);
        }
        else
        {
            output.write('#');
            output.write(String.format("%02X", current).getBytes(Charsets.US_ASCII));
        }
    }
}

因此,任何有趣的Unicode字符都被替换为

Thus, any interesting Unicode character is replaced with the US_ASCII default replacement character which I assume to be '?'.

因此很幸运,PDF名称最经常只包含ASCII字符...;)

So it is quite fortunate that PDF names most often do merely contain ASCII characters... ;)

根据PDF 1.4参考中的实施说明,

According to the implementation notes from the PDF 1.4 reference,

因此,手头的示例文档似乎遵循Acrobat 4的惯例,即上个世纪的惯例。

Thus, the sample document at hand seems to follow conventions from Acrobat 4, i.e. from the last century.

这篇关于PDFBox 2.0:克服字典密钥编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 11:06