问题描述
我刚开始使用PDFBox,提取文本等等。我感兴趣的一件事是我正在提取的文本本身的颜色。但是我似乎无法找到获取该信息的任何方法。
I have just started working with PDFBox, extracting text and so on. One thing I am interested in is the colour of the text itself that I am extracting. However I cannot seem to find any way of getting that information.
是否可以使用PDFBox获取文档的颜色信息,如果可以,将如何使用我这样做了吗?
Is it possible at all to use PDFBox to get the colour information of a document and if so, how would I go about doing so?
非常感谢。
推荐答案
全部颜色信息应存储在类 PDGraphicsState
中,使用的颜色(描边/非描边等)取决于使用的文本呈现模式(通过pdfbox邮件列表)。
All color informations should be stored in the class PDGraphicsState
and the used color (stroking/nonstroking etc.) depends on the used text rendering mode (via pdfbox mailing list).
这是我尝试的一个小样本:
Here is a small sample I tried:
创建一个只有一行的pdf (样本用 RGB = [146,208,80]
编写),以下程序将输出:
After creating a pdf with just one line ("Sample" written in RGB=[146,208,80]
), the following program will output:
以下是代码:
PDDocument doc = null;
try {
doc = PDDocument.load("C:/Path/To/Pdf/Sample.pdf");
PDFStreamEngine engine = new PDFStreamEngine(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties"));
PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0);
engine.processStream(page, page.findResources(), page.getContents().getStream());
PDGraphicsState graphicState = engine.getGraphicsState();
System.out.println(graphicState.getStrokingColor().getColorSpace().getName());
float colorSpaceValues[] = graphicState.getStrokingColor().getColorSpaceValue();
for (float c : colorSpaceValues) {
System.out.println(c * 255);
}
}
finally {
if (doc != null) {
doc.close();
}
查看 PageDrawer.properties
查看PDF运算符如何映射到Java类。
Take a look at PageDrawer.properties
to see how PDF operators are mapped to Java classes.
据我所知,as PDFStreamEngine
处理页面流,它根据当前处理的运算符设置各种变量状态。因此,当它命中绿色文本时,它将改变PDGraphicsState,因为它将遇到适当的运算符。因此对于 CS
,它调用 org.apache.pdfbox.util.operator.SetStrokingColorSpace
,如映射<$ c $所定义c> CS = org.apache.pdfbox.util.operator.SetStrokingColorSpace 在 .properties
文件中。 RG
映射到 org.apache.pdfbox.util.operator.SetStrokingRGBColor
等等。
As I understand it, as PDFStreamEngine
processes a page stream, it sets various variable states depending on what operators it is processing at the moment. So when it hits green text, it will change the PDGraphicsState because it will encounter appropriate operators. So for CS
it calls org.apache.pdfbox.util.operator.SetStrokingColorSpace
as defined by mapping CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace
in the .properties
file. RG
is mapped to org.apache.pdfbox.util.operator.SetStrokingRGBColor
and so on.
在这种情况下, PDGraphicsState
没有改变,因为文档只有文本,文本只有文本只有一种风格。对于更高级的东西,你需要扩展 PDFStreamEngine
(就像 PageDrawer
, PDFTextStripper
和其他类做)在颜色变化时做某事。您也可以在自己的 .properties
文件中编写自己的映射。
In this case, the PDGraphicsState
hasn't changed because the document has just text and the text it has is in just one style. For something more advanced, you would need to extend PDFStreamEngine
(just like PageDrawer
, PDFTextStripper
and other classes do) to do something when color changes. You could also write your own mappings in your own .properties
file.
这篇关于使用PDFBox获取文本颜色的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!