本文介绍了如何在PDF文件中查找图像和文本的(x,y)位置?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你好,
我有一个项目,需要从PDF页面提取文本和图像并建立文档数据库.
我可以使用vb.net,ikvm和pdfbox做到这一点.但是,我仍然无法获取要提取的文本和图像的x,y位置.

那里有什么解决方案(除了使用完整的Java-我不是Java开发人员:-)?

这是我用来提取图像的代码(适应pdfbox文档中的一些示例).问题是ImageX和ImageY总是返回0.正确设置了图像的其他属性(Heigh和Width).

Hello,
I have a project where i need to extract text and images from PDF pages and build a documentation database.
I am able to do it using vb.net, ikvm and pdfbox. However i still cannot get the x, y position of the text and images i am extracting.

Any solutions right there (other than going full Java - i am not a Java developer:-)?

Here is the piece of code i am using to extract images (adapting some examples from pdfbox documentation). Problem is that ImageX and ImageY are always returning 0. Other properties for the image (Heigh and Width) are correctly set.

Private PDF As PDDocument = Nothing
Private PDFPage As PDPage = Nothing
Private PDFPageResources As PDResources = Nothing
Private PDFPageStream As COSStream = Nothing

Private PDFDocumentPages As java.util.ArrayList = Nothing
Private ImageItem As PDXObjectImage = Nothing
Private ImageMap As java.util.Map = Nothing
Private ImageMapIterator As java.util.Iterator = Nothing





Dim PDFEngine = New PDFStreamEngine

 PDFDocumentPages = PDF.getDocumentCatalog.getAllPages()
 PDFPage = PDFDocumentPages.get(0)
 PDFEngine.processStream(PDFPage, PDFPage.findResources, PDFPage.getContents.getStream)

 '
 ImageMap = PDFPage.getResources.getImages()
 If ImageMap IsNot Nothing Then
     Dim ImageNumber As Integer = 1
     ImageMapIterator = ImageMap.keySet.iterator
     While ImageMapIterator.hasNext()

         Dim key As String
         key = CType(ImageMapIterator.next(), String)
         ImageItem = ImageMap.get(key)

         Dim CTM As org.apache.pdfbox.util.Matrix
         CTM = PDFEngine.getGraphicsState.getCurrentTransformationMatrix()

         Dim rotationInRadians As Double = (PDFPage.findRotation * Math.PI) / 180
         Dim rotation As New java.awt.geom.AffineTransform
         rotation.setToRotation(rotationInRadians)

         Dim rotationInverse As java.awt.geom.AffineTransform = rotation.createInverse
         Dim rotationInverseMatrix As New org.apache.pdfbox.util.Matrix
         rotationInverseMatrix.setFromAffineTransform(rotationInverse)

         Dim rotationMatrix As New org.apache.pdfbox.util.Matrix
         rotationMatrix.setFromAffineTransform(rotation)

         Dim unrotatedCTM As org.apache.pdfbox.util.Matrix = CTM.multiply(rotationInverseMatrix)
         Dim xScale As Single = unrotatedCTM.getXScale()
         Dim yScale As Single = unrotatedCTM.getYScale()

         Dim ImageX As Single = unrotatedCTM.getXPosition()
         Dim imageY As Single = unrotatedCTM.getYPosition()
         Dim ImageH As Single = yScale / 100.0F * ImageItem.getHeight()
         Dim ImageW As Single = xScale / 100.0F * ImageItem.getWidth()

...... code to save the image, etc

         ImageNumber += 1
     End While
 End If

推荐答案


这篇关于如何在PDF文件中查找图像和文本的(x,y)位置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-23 11:46