java - 如何使用Java从pdf文件中获取原始文本

我有一些pdf文件，使用pdfbox我已将它们转换为文本并存储为文本文件，现在从我要删除的文本文件中

超链接

所有特殊字符

空行

pdf文件的页眉页脚

“1)”，“2)”，“a)”，“子弹”等。

我想像这样逐行获取有效的文本:

我该如何实现？

最佳答案

使用ojita我们可以实现此示例:

public static void main(String args[]) {

    PDFParser parser = null;
    PDDocument pdDoc = null;
    COSDocument cosDoc = null;
    PDFTextStripper pdfStripper;

    String parsedText;
    String fileName = "E:\\Files\\Small Files\\PDF\\JDBC.pdf";
    File file = new File(fileName);
    try {
        parser = new PDFParser(new FileInputStream(file));
        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdDoc = new PDDocument(cosDoc);
        parsedText = pdfStripper.getText(pdDoc);
        System.out.println(parsedText.replaceAll("[^A-Za-z0-9. ]+", ""));
    } catch (Exception e) {
        e.printStackTrace();
        try {
            if (cosDoc != null)
                cosDoc.close();
            if (pdDoc != null)
                pdDoc.close();
        } catch (Exception e1) {
            e1.printStackTrace();
        }

    }
}