本文介绍了iText - 清理矩形文本而不清除整行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用iText清理pdf文档中矩形内的文本。



以下是我正在使用的代码:

  PdfReader pdfReader = null; 
PdfStamper stamper = null;
尝试
{
int pageNo = 1;

列表< Float> linkBounds = new ArrayList< Float>();
linkBounds.add(0,(float)202.3);
linkBounds.add(1,(float)588.6);
linkBounds.add(2,(float)265.8);
linkBounds.add(3,(float)599.7);

pdfReader = new PdfReader(Test1.pdf);
stamper = new PdfStamper(pdfReader,new FileOutputStream(Test2.pdf));

Rectangle linkLocation = new Rectangle(linkBounds.get(0),linkBounds.get(1),linkBounds.get(2),linkBounds.get(3));

List< PdfCleanUpLocation> cleanUpLocations = new ArrayList< PdfCleanUpLocation>();
cleanUpLocations.add(new PdfCleanUpLocation(pageNo,linkLocation,BaseColor.GRAY));
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations,stamper);
cleaner.cleanUp();
}
catch(例外e)
{
e.printStackTrace();
}
最后
{
try {
stamper.close();
}
catch(例外e){
e.printStackTrace();
}
pdfReader.close();
}

执行这段代码后,它会清除整行文本而不是仅在给定矩形内清理文本。



为了更好地解释事情,我附上了pdf文件。






  • 结果是





    而不是你的





    我甚至使用你在评论和5.5.4中提到的iText版本5.5.5进行了重新测试,但在所有情况下我都得到了正确的结果。



    因此,我无法重现您的问题。






    我离我更近了看看你的output.pdf。它有点特殊,特别是它不包含当前iText版本创建或操作的PDF的典型块。此外,内容流看起来非常不同。



    因此,我认为在iText对您的文件进行后处理后,其他一些工具会经过后处理并且这样做会损坏它。



    特别是准备插入编辑行的页面内容说明在input.pdf中如下所示:

      q 
    0.24 0 0 0.24 113.7055 548.04 cm
    BT
    0.0116 Tc
    45 0 0 45 0 0 Tm
    / TT5 1 Tf
    [...] TJ

    在我直接从iText收到的版本中就像这样:

      q 
    0.24 0 0 0.24 113.7055 548.04 cm
    BT
    0.0116 Tc
    45 0 0 45 0 0 Tm
    / TT5 1 Tf
    0 Tc
    0 Tw
    [...] TJ

    但是你的output.pdf中的相应行看起来像这样

      BT 
    1 0 0 1 113.3 548.5 Tm
    0 Tc
    BT
    1 0 0 1 0 0 Tm
    0 Tc
    [.. 。] TJ

    这里的说明你的output.pdf是




    • 在文本对象中无效 BT ... ET 可能没有其他文本对象,但你有两个 BT 操作,彼此之后没有 ET inbetween;

    • 如果PDF查看器忽略上述错误,则有效地将文本定位在0,0。



    事实上,如果你查看o​​utput.pdf页面的底部,你会看到:





    因此,如果我假设有一些其他程序对iText结果进行后期处理,那么你应该修复后处理器。



    如果没有这样的后处理器,你似乎没有官方发布的iText版本,但完全不同。


    I'm trying to clean up text inside rectangle in pdf document using iText.

    Following is the piece of code I’m using:

    PdfReader pdfReader = null;
    PdfStamper stamper = null;
    try
    {
        int pageNo = 1;
    
        List<Float> linkBounds = new ArrayList<Float>();
        linkBounds.add(0, (float) 202.3);
        linkBounds.add(1, (float) 588.6);
        linkBounds.add(2, (float) 265.8);
        linkBounds.add(3, (float) 599.7);
    
        pdfReader = new PdfReader("Test1.pdf");
        stamper = new PdfStamper(pdfReader, new FileOutputStream("Test2.pdf"));
    
        Rectangle linkLocation = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));
    
        List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
        cleanUpLocations.add(new PdfCleanUpLocation(pageNo, linkLocation, BaseColor.GRAY));
        PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
        cleaner.cleanUp();
    }
    catch (Exception e)
    {
        e.printStackTrace();
    }
    finally
    {
        try {
            stamper.close();
        }
        catch (Exception e) {
            e.printStackTrace();
        }
        pdfReader.close();
    }
    

    After executing this piece of code, it’s clearing up entire line of text instead of cleaning up text only inside given rectangle.

    To explain things in a better way I have attached pdf documents.

    In the input pdf, I have highlighted the text to show the rectangle I’m specifying for cleaning up.

    And, in the output pdf as you can clearly see that there is grey rectangle but if you notice it cleaned up the whole line of text.

    Any help will be appreciated.

    解决方案

    The files input.pdf and output.pdf the OP originally presented did not allow to reproduce the issue but instead seemed not at all to match. Thus, there was an original answer essentially demonstrating that the issue could not be reproduced.

    The second set of files Test1.pdf and Test2.pdf, though, did allow to reproduce the issue, giving rise to the updated answer...

    Updated answer referring to the OP's second set of sample files

    There indeed is an issue in the current (up to 5.5.8) iText clean-up code: In case of tagged files some methods of PdfContentByte used here introduced extra instructions into the content stream which actually damaged it and relocated some text in the eyes of PDF viewers which ignored the damage.

    In more detail:

    PdfCleanUpContentOperator.writeTextChunks used canvas.setCharacterSpacing(0) and canvas.setWordSpacing(0) to initially set the character and word spacing to 0. Unfortunately these methods in case of tagged files check whether the canvas under construction currently is in a text object and (if not) start a text object. This check depends on a local flag set by beginText; but during clean-up text objects are not started using that method. Thus, writeTextChunks here inserts an extra "BT 1 0 0 1 0 0 Tm" sequence damaging the stream and relocating the following text.

    private void writeTextChunks(Map<Integer, Float> structuredTJoperands, List<PdfCleanUpContentChunk> chunks, PdfContentByte canvas,
                                 float characterSpacing, float wordSpacing, float fontSize, float horizontalScaling) throws IOException {
        canvas.setCharacterSpacing(0);
        canvas.setWordSpacing(0);
        ...
    

    PdfCleanUpContentOperator.writeTextChunks instead should use hand-crafted Tc and Tw instructions to not trigger this side effect.

    private void writeTextChunks(Map<Integer, Float> structuredTJoperands, List<PdfCleanUpContentChunk> chunks, PdfContentByte canvas,
                                 float characterSpacing, float wordSpacing, float fontSize, float horizontalScaling) throws IOException {
        if (Float.compare(characterSpacing, 0.0f) != 0 && Float.compare(characterSpacing, -0.0f) != 0) {
            new PdfNumber(0).toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
            canvas.getInternalBuffer().append(Tc);
        }
        if (Float.compare(wordSpacing, 0.0f) != 0 && Float.compare(wordSpacing, -0.0f) != 0) {
            new PdfNumber(0).toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
            canvas.getInternalBuffer().append(Tw);
        }
        canvas.getInternalBuffer().append((byte) '[');
    

    With this change in place the OP's new sample file "Test1.pdf" is properly redacted by the sample code

    @Test
    public void testRedactJavishsTest1() throws IOException, DocumentException
    {
        try (   InputStream resource = getClass().getResourceAsStream("Test1.pdf");
                OutputStream result = new FileOutputStream(new File(OUTPUTDIR, "Test1-redactedJavish.pdf")) )
        {
            PdfReader reader = new PdfReader(resource);
            PdfStamper stamper = new PdfStamper(reader, result);
    
            List<Float> linkBounds = new ArrayList<Float>();
            linkBounds.add(0, (float) 202.3);
            linkBounds.add(1, (float) 588.6);
            linkBounds.add(2, (float) 265.8);
            linkBounds.add(3, (float) 599.7);
    
            Rectangle linkLocation1 = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));
            List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
            cleanUpLocations.add(new PdfCleanUpLocation(1, linkLocation1, BaseColor.GRAY));
    
            PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
            cleaner.cleanUp();
    
            stamper.close();
            reader.close();
        }
    }
    

    (RedactText.java)

    Original answer referring to the OP's original sample files

    I just tried to reproduce your issue using this test method

    @Test
    public void testRedactJavishsText() throws IOException, DocumentException
    {
        try (   InputStream resource = getClass().getResourceAsStream("input.pdf");
                OutputStream result = new FileOutputStream(new File(OUTPUTDIR, "input-redactedJavish.pdf")) )
        {
            PdfReader reader = new PdfReader(resource);
            PdfStamper stamper = new PdfStamper(reader, result);
    
            List<Float> linkBounds = new ArrayList<Float>();
            linkBounds.add(0, (float) 200.7);
            linkBounds.add(1, (float) 547.3);
            linkBounds.add(2, (float) 263.3);
            linkBounds.add(3, (float) 558.4);
    
            Rectangle linkLocation1 = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));
            List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
            cleanUpLocations.add(new PdfCleanUpLocation(1, linkLocation1, BaseColor.GRAY));
    
            PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
            cleaner.cleanUp();
    
            stamper.close();
            reader.close();
        }
    }
    

    (RedactText.java)

    For your source PDF looking like this

    the result was

    and not your

    I even re-tested using the iText versions 5.5.5 you mention in a comment and also 5.5.4, but in all cases I got the correct result.

    Thus, I cannot reproduce your issue.


    I had a closer look at your output.pdf. It is a bit peculiar, in particular it does not contain certain blocks typical for PDFs created or manipulated by current iText versions. Furthermore the content streams look extremely different.

    Thus, I assume that after iText redacted your file some other tool post-processed and in doing so damaged it.

    In particular the page content instructions preparing the insertion of the redacted line look like this in your input.pdf:

    q
    0.24 0 0 0.24 113.7055 548.04 cm
    BT
    0.0116 Tc
    45 0 0 45 0 0 Tm
    /TT5 1 Tf
    [...] TJ
    

    and like this in the version I received directly from iText:

    q
    0.24 0 0 0.24 113.7055 548.04 cm
    BT
    0.0116 Tc
    45 0 0 45 0 0 Tm
    /TT5 1 Tf
    0 Tc
    0 Tw
    [...] TJ
    

    but the corresponding lines in your output.pdf look like this

    BT
    1 0 0 1 113.3 548.5 Tm
    0 Tc
    BT
    1 0 0 1 0 0 Tm
    0 Tc
    [...] TJ
    

    Here the instructions in your output.pdf are

    • invalid as inside a text object BT ... ET there may be no other text object but you have two BT operations following each other without an ET inbetween;
    • effectively positioning the text at 0, 0 if a PDF viewer ignores the error mentioned above.

    And indeed, if you look at the bottom of your output.pdf page you'll see:

    So if my assumption that there is some other program post-processing the iText result, is correct, you should repair that post-processor.

    If there is no such post-processor, you seem not to have the officially published iText version but something altogether different.

    这篇关于iText - 清理矩形文本而不清除整行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 13:23
查看更多