问题描述
我正在使用PDFBox来验证pdf文档.有一定的要求检查PDF中存在的以下文本类型
I am using PDFBox for validating a pdf document .There are certain requirement to check following types of text present in a PDF
- 人造粗体样式文字
- 人造斜体样式文本.
- 人为轮廓样式文本
我确实在PDFBOX api列表中进行了搜索,但是找不到这种api.
I did search in PDFBOX api list but was unable to find such kind of api.
任何人都可以帮助我,告诉我如何使用PDFBOX确定要在PDF中显示的不同类型的人造字体/文本样式.
Can anyone please help me out and tell how to determine different types of artificial font/text styles to be present in a PDF using PDFBOX.
推荐答案
一般步骤和PDFBox问题
从理论上讲,应该首先从PDFTextStripper
派生一个类并覆盖其方法:
The general procedure and a PDFBox issue
In theory one should start this by deriving a class from PDFTextStripper
and overriding its method:
/**
* Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
* and just calls {@link #writeString(String)}.
*
* @param text The text to write to the stream.
* @param textPositions The TextPositions belonging to the text.
* @throws IOException If there is an error when writing the text.
*/
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
writeString(text);
}
您的替代项应使用List<TextPosition> textPositions
而不是String text
;每个TextPosition
本质上表示一个字母,并且绘制该字母时图形状态信息处于活动状态.
Your override then should use List<TextPosition> textPositions
instead of the String text
; each TextPosition
essentially represents a single a single letter and the information on the graphic state active when that letter was drawn.
不幸的是,textPositions
列表没有不包含当前版本1.8.3中的正确内容.例如.为这是普通文本"行.在您的PDF中,方法writeString
被调用了四次,分别为字符串"This","is","normal"和"text"一次.不幸的是,每次textPositions
列表都包含最后一个字符串"text"的字母的TextPosition
实例.
Unfortunately the textPositions
list does not contain the correct contents in the current version 1.8.3. E.g. for the line "This is normal text." from your PDF the method writeString
is called four times, once each for the strings "This", " is", " normal", and " text." Unfortunately the textPositions
list each time contains the TextPosition
instances for the letters of the last string " text."
这实际上已被确认为PDFBox问题 PDFBOX-1804 同时已解决了1.8.4和2.0.0版本的问题.
This actually proved to have already been recognized as PDFBox issue PDFBOX-1804 which meanwhile has been resolved as fixed for versions 1.8.4 and 2.0.0.
这就是说,一旦您有了一个固定的PDFBox版本,就可以检查一些人造样式,如下所示:
This been said, as soon as you have a PDFBox version which is fixed, you can check for some artificial styles as follows:
此文本样式是在页面内容中创建的:
This text style is created like this in the page content:
BT
/F0 1 Tf
24 0 5.10137 24 66 695.5877 Tm
0 Tr
[<03>]TJ
...
相关部分发生在设置文本矩阵 Tm 时. 5.10137是剪切文本的一个因素.
The relevant part happens in setting the text matrix Tm. The 5.10137 is a factor by which the text is sheared.
如上所示检查TextPosition textPosition
时,您可以使用
When you check a TextPosition textPosition
as indicated above, you can query this value using
textPosition.getTextPos().getValue(1, 0)
如果此值相关地大于0.0,则表示为人工斜体.如果相关值小于0.0,则表示为人为的倒斜体.
If this value relevantly is greater than 0.0, you have artificial italics. If it is relevantly less than 0.0, you have artificial backwards italics.
这些人工样式在不同的渲染模式下使用双印刷字母;例如大写字母"T"(大写):
These artificial styles use double printing letters using differing rendering modes; e.g. the capital 'T', in case of bold:
0 0 0 1 k
...
BT
/F0 1 Tf
24 0 0 24 66.36 729.86 Tm
<03>Tj
4 M 0.72 w
0 0 Td
1 Tr
0 0 0 1 K
<03>Tj
ET
(即,首先以常规模式绘制字母,填充字母区域,然后以轮廓模式绘制字母,沿字母边框绘制一条线,均为黑色,CMYK 0、0、0、1;这留下了较粗的字母的印象.)
(i.e. first drawing the letter in regular mode, filling the letter area, and then drawing it in outline mode, drawing a line along the letter border, both in black, CMYK 0, 0, 0, 1; this leaves the impression of a thicker letter.)
,并在大纲的情况下:
BT
/F0 1 Tf
24 0 0 24 66 661.75 Tm
0 0 0 0 k
<03>Tj
/GS1 gs
4 M 0.288 w
0 0 Td
1 Tr
0 0 0 1 K
<03>Tj
ET
(即,首先以常规模式将字母绘制为白色,CMYK 0、0、0、0,填充字母区域,然后以轮廓模式绘制,沿字母边框绘制一条线,以黑色,CMYK 0, 0、0、1;这给人留下黑底白字的印象.)
(i.e. first drawing the letter in regular mode white, CMYK 0, 0, 0, 0, filling the letter area, and then drawing it in outline mode, drawing a line along the letter border, in black, CMYK 0, 0, 0, 1; this leaves the impression of an outlined black on white letter.)
不幸的是,PDFBox PDFTextStripper
无法跟踪文本呈现模式.此外,它在几乎相同的位置显式删除重复出现的字符.因此,这不是识别这些人造样式的任务.
Unfortunately the PDFBox PDFTextStripper
does not keep track of the text rendering mode. Furthermore it explicitly drops duplicate character occurrences in approximately the same position. Thus, it is not up to the task of recognizing these artificial styles.
如果确实需要这样做,则必须更改TextPosition
以包含渲染模式,将PDFStreamEngine
添加到生成的TextPosition
实例中,然后将PDFTextStripper
更改为 在processTextPosition
中放置重复的字形.
If you really need to do so, you'd have to change TextPosition
to also contain the rendering mode, PDFStreamEngine
to add it to the generated TextPosition
instances, and PDFTextStripper
to not drop duplicate glyphs in processTextPosition
.
我写了
这并不完全正确,您可以使用getGraphicsState().getTextState().getRenderingMode()
找到 current 渲染模式.这意味着在processTextPosition
期间,您确实具有可用的渲染模式,并且可以尝试将给定的TextPosition
的渲染模式(和颜色!)信息存储在某个地方,例如在某些Map<TextPosition, ...>
中供以后使用.
This is not entirely true, you can find the current rendering mode using getGraphicsState().getTextState().getRenderingMode()
. This means that during processTextPosition
you do have the rendering mode available and can try and store rendering mode (and color!) information for the given TextPosition
somewhere, e.g. in some Map<TextPosition, ...>
, for later use.
您可以通过调用setSuppressDuplicateOverlappingText(false)
禁用此功能.
You can disable this by calling setSuppressDuplicateOverlappingText(false)
.
通过这两项更改,您还应该能够进行必要的测试,以检查人造粗体和轮廓.
With these two changes you should be able to make the required tests for checking for artificial bold and outline, too.
如果您在processTextPosition
的早期存储并检查样式,则甚至不必进行后一更改.
The latter change might even not be necessary if you store and check for the styles early in processTextPosition
.
如 Corrections 中所述,确实 可以通过以processTextPosition
替代方式收集信息来检索渲染模式和颜色信息.
As mentioned in Corrections it indeed is possible to retrieve rendering mode and color information by collecting that information in a processTextPosition
override.
对此,OP对此进行了评论
To this the OP commented that
起初这有点令人惊讶,但是在查看了PDFTextStripper.properties
(从中初始化了文本提取过程中支持的运算符)后,原因就很清楚了:
This was a bit surprising at first but after looking at the PDFTextStripper.properties
(from which the operators supported during text extraction are initialized), the reason became clear:
# The following operators are not relevant to text extraction,
# so we can silently ignore them.
...
K
k
因此在这种情况下,颜色设置运算符(尤其是本文档中的CMYK颜色设置运算符)将被忽略!幸运的是,PageDrawer
的这些运算符的实现也可以在这种情况下使用.
Thus color setting operators (especially those for CMYK colors as in the present document) are ignored in this context! Fortunately the implementations of these operators for the PageDrawer
can be used in this context, too.
因此,以下概念验证显示了如何检索所有必需的信息.
So the following proof-of-concept shows how all required information can be retrieved.
public class TextWithStateStripperSimple extends PDFTextStripper
{
public TextWithStateStripperSimple() throws IOException {
super();
setSuppressDuplicateOverlappingText(false);
registerOperatorProcessor("K", new org.apache.pdfbox.util.operator.SetStrokingCMYKColor());
registerOperatorProcessor("k", new org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor());
}
@Override
protected void processTextPosition(TextPosition text)
{
renderingMode.put(text, getGraphicsState().getTextState().getRenderingMode());
strokingColor.put(text, getGraphicsState().getStrokingColor());
nonStrokingColor.put(text, getGraphicsState().getNonStrokingColor());
super.processTextPosition(text);
}
Map<TextPosition, Integer> renderingMode = new HashMap<TextPosition, Integer>();
Map<TextPosition, PDColorState> strokingColor = new HashMap<TextPosition, PDColorState>();
Map<TextPosition, PDColorState> nonStrokingColor = new HashMap<TextPosition, PDColorState>();
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
writeString(text + '\n');
for (TextPosition textPosition: textPositions)
{
StringBuilder textBuilder = new StringBuilder();
textBuilder.append(textPosition.getCharacter())
.append(" - shear by ")
.append(textPosition.getTextPos().getValue(1, 0))
.append(" - ")
.append(textPosition.getX())
.append(" ")
.append(textPosition.getY())
.append(" - ")
.append(renderingMode.get(textPosition))
.append(" - ")
.append(toString(strokingColor.get(textPosition)))
.append(" - ")
.append(toString(nonStrokingColor.get(textPosition)))
.append('\n');
writeString(textBuilder.toString());
}
}
String toString(PDColorState colorState)
{
if (colorState == null)
return "null";
StringBuilder builder = new StringBuilder();
for (float f: colorState.getColorSpaceValue())
{
builder.append(' ')
.append(f);
}
return builder.toString();
}
}
使用它,您将获得句点.".普通文本为:
Using this you get the period '.' in normal text as:
. - shear by 0.0 - 256.5701 88.6875 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0
以人造粗体显示;
. - shear by 0.0 - 378.86 122.140015 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0
. - shear by 0.0 - 378.86002 122.140015 - 1 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0
以人工斜体显示:
. - shear by 5.10137 - 327.121 156.4123 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0
在人造轮廓中:
. - shear by 0.0 - 357.25 190.25 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 0.0
. - shear by 0.0 - 357.25 190.25 - 1 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 0.0
那么,您到了识别那些人造样式所需的所有信息.现在,您只需要分析数据即可.
So, there you are, all information required for recognition of those artificial styles. Now you merely have to analyze the data.
顺便说一句,看看人造的粗体情况:坐标可能并不总是相同的,而是非常相似的.因此,测试两个文本位置对象是否描述相同位置需要一定的宽容度.
BTW, have a look at the artificial bold case: The coordinates might not always be identical but instead merely very similar. Thus, some leniency is required for the test whether two text position objects describe the same position.
这篇关于如何使用PDFBOX确定文本的人工粗体样式,人工斜体样式和人工轮廓样式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!