问题描述
有人可以给我一个例子,说明如何使用PDFBox提取单词的坐标
Could someone give me an example of how to extract coordinates for a 'word' with PDFBox
我正在使用此链接提取单个字符的位置:
I am using this link to extract positions of individual characters:https://www.tutorialkart.com/pdfbox/how-to-extract-coordinates-or-position-of-characters-in-pdf/
我是使用此链接提取单词:
I am using this link to extract words:https://www.tutorialkart.com/pdfbox/extract-words-from-pdf-document/
我被困在获取整个单词的坐标。
I am stuck getting coordinates for whole words.
推荐答案
您可以通过收集所有构建单词的 TextPosition
对象来提取单词的坐标并结合它们的边界框。
You can extract the coordinates of words by collecting all the TextPosition
objects building a word and combining their bounding boxes.
按照您所参考的两个教程的方法来实现ed,您可以像这样扩展 PDFTextStripper
:
Implementing this along the lines of the two tutorials you referenced, you can extend PDFTextStripper
like this:
public class GetWordLocationAndSize extends PDFTextStripper {
public GetWordLocationAndSize() throws IOException {
}
@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
String wordSeparator = getWordSeparator();
List<TextPosition> word = new ArrayList<>();
for (TextPosition text : textPositions) {
String thisChar = text.getUnicode();
if (thisChar != null) {
if (thisChar.length() >= 1) {
if (!thisChar.equals(wordSeparator)) {
word.add(text);
} else if (!word.isEmpty()) {
printWord(word);
word.clear();
}
}
}
}
if (!word.isEmpty()) {
printWord(word);
word.clear();
}
}
void printWord(List<TextPosition> word) {
Rectangle2D boundingBox = null;
StringBuilder builder = new StringBuilder();
for (TextPosition text : word) {
Rectangle2D box = new Rectangle2D.Float(text.getXDirAdj(), text.getYDirAdj(), text.getWidthDirAdj(), text.getHeightDir());
if (boundingBox == null)
boundingBox = box;
else
boundingBox.add(box);
builder.append(text.getUnicode());
}
System.out.println(builder.toString() + " [(X=" + boundingBox.getX() + ",Y=" + boundingBox.getY()
+ ") height=" + boundingBox.getHeight() + " width=" + boundingBox.getWidth() + "]");
}
}
(内部类)
(ExtractWordCoordinates inner class)
并像这样运行它:
PDDocument document = PDDocument.load(resource);
PDFTextStripper stripper = new GetWordLocationAndSize();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
(测试 testExtractWordsForGoodJuJu
)
适用于apache.pdf示例,本教程使用的示例为:
Applied to the apache.pdf example the tutorials use you get:
2017-8-6 [(X=26.004425048828125,Y=22.00372314453125) height=5.833024024963379 width=36.31868362426758]
Welcome [(X=226.44479370117188,Y=22.00372314453125) height=5.833024024963379 width=36.5999755859375]
to [(X=265.5881652832031,Y=22.00372314453125) height=5.833024024963379 width=8.032623291015625]
The [(X=276.1641845703125,Y=22.00372314453125) height=5.833024024963379 width=14.881439208984375]
Apache [(X=293.5890197753906,Y=22.00372314453125) height=5.833024024963379 width=29.848846435546875]
Software [(X=325.98126220703125,Y=22.00372314453125) height=5.833024024963379 width=35.271636962890625]
Foundation! [(X=363.7962951660156,Y=22.00372314453125) height=5.833024024963379 width=47.871429443359375]
Custom [(X=334.0334777832031,Y=157.6195068359375) height=4.546705722808838 width=25.03936767578125]
Search [(X=360.8929138183594,Y=157.6195068359375) height=4.546705722808838 width=22.702728271484375]
这篇关于有人可以给我一个例子,说明如何使用PDFBox提取“单词”的坐标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!