本文介绍了PDFBox输出问号而不是某些日语字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从几乎所有用日语编写的pdf文件中,我都使用Apache Tika(1.7)和Apache PDFBox(1.8.8)获得了正确的文本.现在,我遇到了pdf文件的问题,由于商业原因我无法将其上传到这里.

From almost all pdf files written in Japanese, I got correct text with Apache Tika(1.7) and Apache PDFBox(1.8.8).Now I have a trouble with a pdf file which i cannot upload it here by business reason.

段落中的所有日语字符都变为?",但在其他段落中,日语字符是正确的.在任何情况下,ASCII字符都是正确的.

All Japanese characters in a paragraph becomes "?", but in other paragraphs, Japanese characters are correct.in any case, ASCII chars are correct.

在Windows 7桌面上的Adobe Acrobat中,PDF文档中的所有日语字符似乎都是正确的.在Adobe Acrobat属性对话框中,PDF文档具有一些日语字体信息.我不知道是谁/如何制作此文件的.

All Japanese characters in the PDF document are seems to be correct in Adobe Acrobat on my Windows7 desktop.from Adobe Acrobat properties dialog, the PDF document has several Japanese font information. i don't know who/how made this file.

  • MS-Mincho类型:TrueType(CID)<-几个
  • HeiseiMin-W3类型:Type 1(CID)编码:UniJIS-UCS2-HW-H实际字体:KozMinPr6N-Regular实际字体类型:Type 1(CID)
  • MSMincho类型:TrueType(CID)编码:UniJIS-UCS2-H实际字体:MS明朝实际字体类型:TrueType

PDF Converter:Acrobat Distiller 7.0(Windows)PDF版本:1.6(Acrobat 7.x)

PDF Converter:Acrobat Distiller 7.0(Windows)PDF Version:1.6(Acrobat 7.x)

?".在这种情况下,cmap(PDFont类的)的cmap名称为"UniJIS-UCS2-HW-H".仔细查看CMap的实现,isInCodeSpaceRanges方法在应为true时将返回true.最后,因为char2CIDMappings没有条目且range.map失败在CMap中(第174行左右),lookupCID失败.char []参数的值例如为[48,-120,48,-118,...]对于我来说似乎是Unicode的正确代码点...

"?"s are made in PDFStreamEngine (line 492) caused by lookup failure in PDType0Font(line 202).cmapName of cmap(of PDFont class) in this situation is "UniJIS-UCS2-HW-H".looking at CMap implementation carefully, isInCodeSpaceRanges method returns true when it should be true.finally, because char2CIDMappings has no entry and range.map fails In CMap(around line 174), lookupCID fails.An argument char[] has values such as [48, -120, 48, -118, ...] seems to be correct code points in Unicode for me...

有什么解决方法吗?谢谢.

is there any workaround? thanks.

推荐答案

我通过将文本变成这样的图像来解决了pdfbox中的字体问题(中文,日文,韩文和其他任何字体)

I solved font issues (chinese, japanese, korean and any other) in pdfbox by turning text into image like this

void writeLine(String text, int x, int y, int width, int height,
           Font font, Color color, PDPageContentStream contentStream, PDDocument document) throws IOException {

    try (
    ByteArrayOutputStream baos = new ByteArrayOutputStream()
    ) {
    int scale = 2;
    BufferedImage img = new BufferedImage(width * scale, height * scale, BufferedImage.TYPE_INT_ARGB);
    Graphics2D g2d = img.createGraphics();
    g2d.setRenderingHint(RenderingHints.KEY_ALPHA_INTERPOLATION, RenderingHints.VALUE_ALPHA_INTERPOLATION_QUALITY);
    g2d.setRenderingHint(RenderingHints.KEY_ANTIALIASING, RenderingHints.VALUE_ANTIALIAS_ON);
    g2d.setRenderingHint(RenderingHints.KEY_TEXT_ANTIALIASING, RenderingHints.VALUE_TEXT_ANTIALIAS_ON);
    g2d.setRenderingHint(RenderingHints.KEY_COLOR_RENDERING, RenderingHints.VALUE_COLOR_RENDER_QUALITY);
    g2d.setRenderingHint(RenderingHints.KEY_DITHERING, RenderingHints.VALUE_DITHER_ENABLE);
    g2d.setRenderingHint(RenderingHints.KEY_FRACTIONALMETRICS, RenderingHints.VALUE_FRACTIONALMETRICS_ON);
    g2d.setRenderingHint(RenderingHints.KEY_INTERPOLATION, RenderingHints.VALUE_INTERPOLATION_BILINEAR);
    g2d.setRenderingHint(RenderingHints.KEY_RENDERING, RenderingHints.VALUE_RENDER_SPEED);
    g2d.setRenderingHint(RenderingHints.KEY_STROKE_CONTROL, RenderingHints.VALUE_STROKE_PURE);
    g2d.setFont(font);
    g2d.setColor(color);
    g2d.scale(scale,scale);
    g2d.drawString(text, 0, g2d.getFontMetrics().getAscent());
    g2d.dispose();

    ImageIO.write(img, "png", baos);
    baos.flush();
    baos.close();

    contentStream.drawImage(PDImageXObject.createFromByteArray(
        document,baos.toByteArray(), ""), x, y, width, height);
    }
}

这篇关于PDFBox输出问号而不是某些日语字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-18 09:57