问题描述
我正在尝试使用iText清理pdf文档中矩形内的文本。
以下是我正在使用的代码:
PdfReader pdfReader = null;
PdfStamper stamper = null;
尝试
{
int pageNo = 1;
列表< Float> linkBounds = new ArrayList< Float>();
linkBounds.add(0,(float)202.3);
linkBounds.add(1,(float)588.6);
linkBounds.add(2,(float)265.8);
linkBounds.add(3,(float)599.7);
pdfReader = new PdfReader(Test1.pdf);
stamper = new PdfStamper(pdfReader,new FileOutputStream(Test2.pdf));
Rectangle linkLocation = new Rectangle(linkBounds.get(0),linkBounds.get(1),linkBounds.get(2),linkBounds.get(3));
List< PdfCleanUpLocation> cleanUpLocations = new ArrayList< PdfCleanUpLocation>();
cleanUpLocations.add(new PdfCleanUpLocation(pageNo,linkLocation,BaseColor.GRAY));
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations,stamper);
cleaner.cleanUp();
}
catch(例外e)
{
e.printStackTrace();
}
最后
{
try {
stamper.close();
}
catch(例外e){
e.printStackTrace();
}
pdfReader.close();
}
执行这段代码后,它会清除整行文本而不是仅在给定矩形内清理文本。
为了更好地解释事情,我附上了pdf文件。
-
结果是
而不是你的
我甚至使用你在评论和5.5.4中提到的iText版本5.5.5进行了重新测试,但在所有情况下我都得到了正确的结果。
因此,我无法重现您的问题。
我离我更近了看看你的output.pdf。它有点特殊,特别是它不包含当前iText版本创建或操作的PDF的典型块。此外,内容流看起来非常不同。
因此,我认为在iText对您的文件进行后处理后,其他一些工具会经过后处理并且这样做会损坏它。
特别是准备插入编辑行的页面内容说明在input.pdf中如下所示:
q
0.24 0 0 0.24 113.7055 548.04 cm
BT
0.0116 Tc
45 0 0 45 0 0 Tm
/ TT5 1 Tf
[...] TJ
在我直接从iText收到的版本中就像这样:
q
0.24 0 0 0.24 113.7055 548.04 cm
BT
0.0116 Tc
45 0 0 45 0 0 Tm
/ TT5 1 Tf
0 Tc
0 Tw
[...] TJ
但是你的output.pdf中的相应行看起来像这样
BT
1 0 0 1 113.3 548.5 Tm
0 Tc
BT
1 0 0 1 0 0 Tm
0 Tc
[.. 。] TJ
这里的说明你的output.pdf是
- 在文本对象中无效
BT ... ET
可能没有其他文本对象,但你有两个BT
操作,彼此之后没有ET
inbetween; - 如果PDF查看器忽略上述错误,则有效地将文本定位在0,0。
事实上,如果你查看output.pdf页面的底部,你会看到:
因此,如果我假设有一些其他程序对iText结果进行后期处理,那么你应该修复后处理器。
如果没有这样的后处理器,你似乎没有官方发布的iText版本,但完全不同。
I'm trying to clean up text inside rectangle in pdf document using iText.
Following is the piece of code I’m using:
PdfReader pdfReader = null; PdfStamper stamper = null; try { int pageNo = 1; List<Float> linkBounds = new ArrayList<Float>(); linkBounds.add(0, (float) 202.3); linkBounds.add(1, (float) 588.6); linkBounds.add(2, (float) 265.8); linkBounds.add(3, (float) 599.7); pdfReader = new PdfReader("Test1.pdf"); stamper = new PdfStamper(pdfReader, new FileOutputStream("Test2.pdf")); Rectangle linkLocation = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3)); List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>(); cleanUpLocations.add(new PdfCleanUpLocation(pageNo, linkLocation, BaseColor.GRAY)); PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper); cleaner.cleanUp(); } catch (Exception e) { e.printStackTrace(); } finally { try { stamper.close(); } catch (Exception e) { e.printStackTrace(); } pdfReader.close(); }
After executing this piece of code, it’s clearing up entire line of text instead of cleaning up text only inside given rectangle.
To explain things in a better way I have attached pdf documents.
In the input pdf, I have highlighted the text to show the rectangle I’m specifying for cleaning up.
And, in the output pdf as you can clearly see that there is grey rectangle but if you notice it cleaned up the whole line of text.
Any help will be appreciated.
解决方案The files
input.pdf
andoutput.pdf
the OP originally presented did not allow to reproduce the issue but instead seemed not at all to match. Thus, there was an original answer essentially demonstrating that the issue could not be reproduced.The second set of files
Test1.pdf
andTest2.pdf
, though, did allow to reproduce the issue, giving rise to the updated answer...Updated answer referring to the OP's second set of sample files
There indeed is an issue in the current (up to 5.5.8) iText clean-up code: In case of tagged files some methods of
PdfContentByte
used here introduced extra instructions into the content stream which actually damaged it and relocated some text in the eyes of PDF viewers which ignored the damage.In more detail:
PdfCleanUpContentOperator.writeTextChunks
usedcanvas.setCharacterSpacing(0)
andcanvas.setWordSpacing(0)
to initially set the character and word spacing to 0. Unfortunately these methods in case of tagged files check whether the canvas under construction currently is in a text object and (if not) start a text object. This check depends on a local flag set bybeginText
; but during clean-up text objects are not started using that method. Thus,writeTextChunks
here inserts an extra"BT 1 0 0 1 0 0 Tm"
sequence damaging the stream and relocating the following text.private void writeTextChunks(Map<Integer, Float> structuredTJoperands, List<PdfCleanUpContentChunk> chunks, PdfContentByte canvas, float characterSpacing, float wordSpacing, float fontSize, float horizontalScaling) throws IOException { canvas.setCharacterSpacing(0); canvas.setWordSpacing(0); ...
PdfCleanUpContentOperator.writeTextChunks
instead should use hand-craftedTc
andTw
instructions to not trigger this side effect.private void writeTextChunks(Map<Integer, Float> structuredTJoperands, List<PdfCleanUpContentChunk> chunks, PdfContentByte canvas, float characterSpacing, float wordSpacing, float fontSize, float horizontalScaling) throws IOException { if (Float.compare(characterSpacing, 0.0f) != 0 && Float.compare(characterSpacing, -0.0f) != 0) { new PdfNumber(0).toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer()); canvas.getInternalBuffer().append(Tc); } if (Float.compare(wordSpacing, 0.0f) != 0 && Float.compare(wordSpacing, -0.0f) != 0) { new PdfNumber(0).toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer()); canvas.getInternalBuffer().append(Tw); } canvas.getInternalBuffer().append((byte) '[');
With this change in place the OP's new sample file "Test1.pdf" is properly redacted by the sample code
@Test public void testRedactJavishsTest1() throws IOException, DocumentException { try ( InputStream resource = getClass().getResourceAsStream("Test1.pdf"); OutputStream result = new FileOutputStream(new File(OUTPUTDIR, "Test1-redactedJavish.pdf")) ) { PdfReader reader = new PdfReader(resource); PdfStamper stamper = new PdfStamper(reader, result); List<Float> linkBounds = new ArrayList<Float>(); linkBounds.add(0, (float) 202.3); linkBounds.add(1, (float) 588.6); linkBounds.add(2, (float) 265.8); linkBounds.add(3, (float) 599.7); Rectangle linkLocation1 = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3)); List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>(); cleanUpLocations.add(new PdfCleanUpLocation(1, linkLocation1, BaseColor.GRAY)); PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper); cleaner.cleanUp(); stamper.close(); reader.close(); } }
Original answer referring to the OP's original sample files
I just tried to reproduce your issue using this test method
@Test public void testRedactJavishsText() throws IOException, DocumentException { try ( InputStream resource = getClass().getResourceAsStream("input.pdf"); OutputStream result = new FileOutputStream(new File(OUTPUTDIR, "input-redactedJavish.pdf")) ) { PdfReader reader = new PdfReader(resource); PdfStamper stamper = new PdfStamper(reader, result); List<Float> linkBounds = new ArrayList<Float>(); linkBounds.add(0, (float) 200.7); linkBounds.add(1, (float) 547.3); linkBounds.add(2, (float) 263.3); linkBounds.add(3, (float) 558.4); Rectangle linkLocation1 = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3)); List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>(); cleanUpLocations.add(new PdfCleanUpLocation(1, linkLocation1, BaseColor.GRAY)); PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper); cleaner.cleanUp(); stamper.close(); reader.close(); } }
For your source PDF looking like this
the result was
and not your
I even re-tested using the iText versions 5.5.5 you mention in a comment and also 5.5.4, but in all cases I got the correct result.
Thus, I cannot reproduce your issue.
I had a closer look at your output.pdf. It is a bit peculiar, in particular it does not contain certain blocks typical for PDFs created or manipulated by current iText versions. Furthermore the content streams look extremely different.
Thus, I assume that after iText redacted your file some other tool post-processed and in doing so damaged it.
In particular the page content instructions preparing the insertion of the redacted line look like this in your input.pdf:
q 0.24 0 0 0.24 113.7055 548.04 cm BT 0.0116 Tc 45 0 0 45 0 0 Tm /TT5 1 Tf [...] TJ
and like this in the version I received directly from iText:
q 0.24 0 0 0.24 113.7055 548.04 cm BT 0.0116 Tc 45 0 0 45 0 0 Tm /TT5 1 Tf 0 Tc 0 Tw [...] TJ
but the corresponding lines in your output.pdf look like this
BT 1 0 0 1 113.3 548.5 Tm 0 Tc BT 1 0 0 1 0 0 Tm 0 Tc [...] TJ
Here the instructions in your output.pdf are
- invalid as inside a text object
BT ... ET
there may be no other text object but you have twoBT
operations following each other without anET
inbetween; - effectively positioning the text at 0, 0 if a PDF viewer ignores the error mentioned above.
And indeed, if you look at the bottom of your output.pdf page you'll see:
So if my assumption that there is some other program post-processing the iText result, is correct, you should repair that post-processor.
If there is no such post-processor, you seem not to have the officially published iText version but something altogether different.
这篇关于iText - 清理矩形文本而不清除整行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
- 在文本对象中无效