本文介绍了从pdf文件中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我需要从pdf文件中提取文字(逐字逐句)。
I need to extract text (word by word) from a pdf file.
import java.io.*;
import com.itextpdf.text.*;
import com.itextpdf.text.pdf.*;
import com.itextpdf.text.pdf.parser.*;
public class pdf {
private static String INPUTFILE = "http://ontology.buffalo.edu/ontology%28PIC%29.pdf" ;
private static String OUTPUTFILE = "c:/new3.pdf";
public static void main(String[] args) throws DocumentException,
IOException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document,
new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
PdfImportedPage page;
// Go through all pages
for (int i = 1; i <= n; i++) {
page = writer.getImportedPage(reader, i);
System.out.println(i);
Image instance = Image.getInstance(page);
document.add(instance);
}
document.close();
PdfReader readerN = new PdfReader(OUTPUTFILE);
PdfTextExtractor parse = new PdfTextExtractor();
for (int i = 1; i <= n; i++)
System.out.println(parser.getTextFromPage(reader,i));
}
当我编译代码时,我有此错误:
When I compile the code, I have this error:
如何解决这个问题?
推荐答案
PDFTextExtractor只包含静态方法,构造函数是私有的。
PDFTextExtractor only contains static methods and the constructor is private. itext
你可以像这样调用:
String myLine = PDFTextExtractor.getTextFromPage(reader,pageNumber)
这篇关于从pdf文件中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!