使用PDFBox获取文本颜色

本文介绍了使用PDFBox获取文本颜色的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我刚开始使用PDFBox，提取文本等等。我感兴趣的一件事是我正在提取的文本本身的颜色。但是我似乎无法找到获取该信息的任何方法。

I have just started working with PDFBox, extracting text and so on. One thing I am interested in is the colour of the text itself that I am extracting. However I cannot seem to find any way of getting that information.

是否可以使用PDFBox获取文档的颜色信息，如果可以，将如何使用我这样做了吗？

Is it possible at all to use PDFBox to get the colour information of a document and if so, how would I go about doing so?

非常感谢。

推荐答案

全部颜色信息应存储在类 PDGraphicsState 中，使用的颜色（描边/非描边等）取决于使用的文本呈现模式（通过pdfbox邮件列表）。

All color informations should be stored in the class PDGraphicsState and the used color (stroking/nonstroking etc.) depends on the used text rendering mode (via pdfbox mailing list).

这是我尝试的一个小样本：

Here is a small sample I tried:

创建一个只有一行的pdf （样本用 RGB = [146,208,80] 编写），以下程序将输出：

After creating a pdf with just one line ("Sample" written in RGB=[146,208,80]), the following program will output:

以下是代码：

PDDocument doc = null;
try {
    doc = PDDocument.load("C:/Path/To/Pdf/Sample.pdf");
    PDFStreamEngine engine = new PDFStreamEngine(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties"));
    PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0);
    engine.processStream(page, page.findResources(), page.getContents().getStream());
    PDGraphicsState graphicState = engine.getGraphicsState();
    System.out.println(graphicState.getStrokingColor().getColorSpace().getName());
    float colorSpaceValues[] = graphicState.getStrokingColor().getColorSpaceValue();
    for (float c : colorSpaceValues) {
        System.out.println(c * 255);
    }
}
finally {
    if (doc != null) {
        doc.close();
    }

查看 PageDrawer.properties 查看PDF运算符如何映射到Java类。

Take a look at PageDrawer.properties to see how PDF operators are mapped to Java classes.

据我所知，as PDFStreamEngine 处理页面流，它根据当前处理的运算符设置各种变量状态。因此，当它命中绿色文本时，它将改变PDGraphicsState，因为它将遇到适当的运算符。因此对于 CS ，它调用 org.apache.pdfbox.util.operator.SetStrokingColorSpace ，如映射<$ c $所定义c> CS = org.apache.pdfbox.util.operator.SetStrokingColorSpace 在 .properties 文件中。 RG 映射到 org.apache.pdfbox.util.operator.SetStrokingRGBColor 等等。

As I understand it, as PDFStreamEngine processes a page stream, it sets various variable states depending on what operators it is processing at the moment. So when it hits green text, it will change the PDGraphicsState because it will encounter appropriate operators. So for CS it calls org.apache.pdfbox.util.operator.SetStrokingColorSpace as defined by mapping CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace in the .properties file. RG is mapped to org.apache.pdfbox.util.operator.SetStrokingRGBColor and so on.

在这种情况下， PDGraphicsState 没有改变，因为文档只有文本，文本只有文本只有一种风格。对于更高级的东西，你需要扩展 PDFStreamEngine （就像 PageDrawer ， PDFTextStripper 和其他类做）在颜色变化时做某事。您也可以在自己的 .properties 文件中编写自己的映射。

In this case, the PDGraphicsState hasn't changed because the document has just text and the text it has is in just one style. For something more advanced, you would need to extend PDFStreamEngine (just like PageDrawer, PDFTextStripper and other classes do) to do something when color changes. You could also write your own mappings in your own .properties file.

这篇关于使用PDFBox获取文本颜色的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！