我试图读取一个流,并希望为每个字符串获取准确的位置(坐标)
int size = reader.getXrefSize();
for (int i = 0; i < size; ++i)
{
PdfObject pdfObject = reader.getPdfObject(i);
if ((pdfObject == null) || !pdfObject.isStream())
continue;
PdfStream stream = (PdfStream) pdfObject;
PdfObject obj = stream.get(PdfName.FILTER);
if ((obj != null) && obj.toString().equals(PdfName.FLATEDECODE.toString()))
{
byte[] codedText = PdfReader.getStreamBytesRaw((PRStream) stream);
byte[] text = PdfReader.FlateDecode(codedText);
FileOutputStream o = new FileOutputStream(new File("/home..../Text" + i + ".txt"));
o.write(text);
o.flush();
o.close();
}
}
我实际上有这样的职位
......
BT
70.9 800.9 Td /F1 14 Tf <01> Tj
10.1 0 Td <02> Tj
9.3 0 Td <03> Tj
3.9 0 Td <01> Tj
10.1 0 Td <0405> Tj
18.7 0 Td <060607> Tj
21 0 Td <08090A07> Tj
24.9 0 Td <05> Tj
10.1 0 Td <0B0C0D> Tj
28.8 0 Td <0E> Tj
3.8 0 Td <0F> Tj
8.6 0 Td <090B1007> Tj
29.5 0 Td <0B11> Tj
16.4 0 Td <12> Tj
7.8 0 Td <1307> Tj
12.4 0 Td <14> Tj
7.8 0 Td <07> Tj
3.9 0 Td <15> Tj
7.8 0 Td <16> Tj
7.8 0 Td <07> Tj
3.9 0 Td <17> Tj
10.8 0 Td <0D> Tj
7.8 0 Td <18> Tj
10.9 0 Td <19> Tj
ET
.....
但是我不知道哪个弦适合哪个位置
另一方面,在Itext中,我可以使用
PdfReader reader = new PdfReader(new FileInputStream("/home/....xxx.pdf"));
PdfTextExtractor extract = new PdfTextExtractor(reader);
但当然根本没有任何位置。
那么,如何获取每个文本(字符串,字符,...)的确切位置?
最佳答案
正如plinth和David van Driessche在他们的答案中已经指出的那样,从PDF文件中提取文本是不平凡的。幸运的是,iText解析器包中的类为您完成了大部分繁重的工作。您已经从该包中找到了至少一个类,PdfTextExtractor,
,但是如果您仅对页面的纯文本感兴趣,则该类本质上是使用iText的解析器功能的便捷实用程序。在您的情况下,您必须更深入地研究该软件包中的类。
获取有关使用iText进行文本提取的信息的起点是15.3节,解析iText in Action — 2nd Edition的PDF,尤其是示例ParsingHelloWorld.java的方法extractText
:
public void extractText(String src, String dest) throws IOException
{
PrintWriter out = new PrintWriter(new FileOutputStream(dest));
PdfReader reader = new PdfReader(src);
RenderListener listener = new MyTextRenderListener(out);
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
PdfDictionary pageDic = reader.getPageN(1);
PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES);
processor.processContent(ContentByteUtils.getContentBytesForPage(reader, 1), resourcesDic);
out.flush();
out.close();
}
利用
RenderListener
实现MyTextRenderListener.java:public class MyTextRenderListener implements RenderListener
{
[...]
/**
* @see RenderListener#renderText(TextRenderInfo)
*/
public void renderText(TextRenderInfo renderInfo) {
out.print("<");
out.print(renderInfo.getText());
out.print(">");
}
}
尽管此
RenderListener
实现仅输出文本,但它检查的TextRenderInfo对象提供了更多信息:public LineSegment getBaseline(); // the baseline for the text (i.e. the line that the text 'sits' on)
public LineSegment getAscentLine(); // the ascentline for the text (i.e. the line that represents the topmost extent that a string of the current font could have)
public LineSegment getDescentLine(); // the descentline for the text (i.e. the line that represents the bottom most extent that a string of the current font could have)
public float getRise() ; // the rise which represents how far above the nominal baseline the text should be rendered
public String getText(); // the text to render
public int getTextRenderMode(); // the text render mode
public DocumentFont getFont(); // the font
public float getSingleSpaceWidth(); // the width, in user space units, of a single space character in the current font
public List<TextRenderInfo> getCharacterRenderInfos(); // details useful if a listener needs access to the position of each individual glyph in the text render operation
因此,如果您的
RenderListener
除了使用getText()
检查文本之外,还考虑了getBaseline()
甚至getAscentLine()
和getDescentLine().
,您将拥有所有可能需要的坐标。PS:在
ParsingHelloWorld.extractText()
,PdfReaderContentParser中有一个包装类,它允许您简单地在给定a PdfReader reader,
an int page,
和a RenderListener renderListener:
的情况下编写以下内容PdfReaderContentParser parser = new PdfReaderContentParser(reader);
parser.processContent(page, renderListener);