使用iText从pdf文件中提取文本

使用iText从pdf文件中提取文本

本文介绍了使用iText从pdf文件中提取文本列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用iText从pdf文件中提取文本。



问题是:一些pdf文件包含2列,当我提取文本时,我得到一个文本文件列被合并为结果(即同一行中两列的文本)



这是代码:

  public class pdf 
{
private static String INPUTFILE =http://www.revuemedecinetropicale.com/TAP_519-522_-_AO_07151GT_Rasoamananjara__ao.pdf;
private static String OUTPUTFILE =c:/new3.pdf;

public static void main(String [] args)抛出DocumentException,IOException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document,new FileOutputStream(OUTPUTFILE));
document.open();

PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();

PdfImportedPage页面;

//遍历所有页面
for(int i = 1; i< = n; i ++){
page = writer.getImportedPage(reader,i);
Image instance = Image.getInstance(page);
document.add(instance);
}

document.close();

PdfReader readerN = new PdfReader(OUTPUTFILE);
for(int i = 1; i< = n; i ++){
String myLine = PdfTextExtractor.getTextFromPage(readerN,i);
System.out.println(myLine);

try {
FileWriter fw = new FileWriter(c:/yo.txt,true);
fw.write(myLine);
fw.close();
} catch(IOException ioe){ioe.printStackTrace(); }
}
}

你能帮我完成这项任务吗? / p>

解决方案

我是iText文本提取子系统的作者。你需要做的是开发自己的文本提取策略(如果你看看如何实现 PdfTextExtractor.getTextFromPage ,你会发现你可以提供一个可插拔的策略。) / p>

如何确定列的开始和停止位置完全取决于您 - 这是一个难题 - PDF没有任何列的概念(哎呀,它甚至没有单词的概念 - 只是把默认策略提供的文本提取放在一起是非常棘手的。如果您知道列的高级位置,那么您可以在文本渲染侦听器回调中使用区域过滤器(iText库中有代码用于执行此操作,最新版本的iText In Action书籍提供了详细示例) 。



如果你需要从任意数据中获取列,你就可以在你面前做一些算法工作了(如果你得到了一些工作,我会喜欢看)。关于如何处理这个的一些想法:


  1. 使用类似于默认文本提取策略(LocationAware ...)中使用的算法来获取单词列表和X / Y位置(确保也考虑旋转角度)

  2. 对于每个单词,绘制一条运行页面整个高度的虚线。扫描从相同X位置开始的所有其他单词。

  3. 扫描时,还要查找与X位置相交的单词(但不要在X位置开始)。这将为您提供页面上列开始/停止Y位置的潜在位置。

  4. 一旦有了X和Y列,就可以使用区域过滤方法

另一种可能同样可行的方法是分析绘制操作并查找长水平和垂直线(假设列以类似表的格式划分)。目前,iText内容解析器没有针对这些操作的回调,但是可以毫无困难地添加它们。


I need to extract text from pdf files using iText.

The problem is: some pdf files contain 2 columns and when I extract text I get a text file where columns are merged as the result (i.e. text from both columns in the same line)

this is the code:

public class pdf
{
    private static String INPUTFILE = "http://www.revuemedecinetropicale.com/TAP_519-522_-_AO_07151GT_Rasoamananjara__ao.pdf" ;
    private static String OUTPUTFILE = "c:/new3.pdf";

    public static void main(String[] args) throws DocumentException, IOException {
        Document document = new Document();
        PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(OUTPUTFILE));
        document.open();

        PdfReader reader = new PdfReader(INPUTFILE);
        int n = reader.getNumberOfPages();

        PdfImportedPage page;

        // Go through all pages
        for (int i = 1; i <= n; i++) {
            page = writer.getImportedPage(reader, i);
            Image instance = Image.getInstance(page);
            document.add(instance);
        }

        document.close();

        PdfReader readerN = new PdfReader(OUTPUTFILE);
        for (int i = 1; i <= n; i++) {
            String myLine = PdfTextExtractor.getTextFromPage(readerN,i);
            System.out.println(myLine);

            try {
                FileWriter fw = new FileWriter("c:/yo.txt",true);
                fw.write(myLine);
                fw.close();
            }catch (IOException ioe) {ioe.printStackTrace(); }
    }
}

Could you please help me with the task?

解决方案

I am the author of the iText text extraction sub-system. What you need to do is develop your own text extraction strategy (if you look at how PdfTextExtractor.getTextFromPage is implemented, you will see that you can provide a pluggable strategy).

How you are going to determine where columns start and stop is entirely up to you - this is a difficult problem - PDF doesn't have any concept of columns (heck, it doesn't even have a concept of words - just putting together the text extraction that the default strategy provides is quite tricky). If you know in advanced where the columns are, then you can use a region filter on the text render listener callback (there is code in the iText library for doing this, and the latest version of the iText In Action book gives a detailed example).

If you need to obtain columns from arbitrary data, you've got some algorithm work ahead of you (if you get something working, I'd love to take a look). Some ideas on how to approach this:

  1. Use an algorithm similar to that used in the default text extraction strategy (LocationAware...) to obtain a list of words and X/Y locations (be sure to account for rotation angle as well)
  2. For each word, draw an imaginary line running the full height of the page. Scan for all other words that start at the same X position.
  3. While scanning, also look for words that intersect the X position (but do not start on the X position). This will give you potential location for column start/stop Y positions on the page.
  4. Once you have column X and Y, you can resort to a region filtered approach

Another approach that may be equally feasible would be to analyze draw operations and look for long horizontal and vertical lines (assuming the columns are demarcated in a table-like format). Right now, the iText content parser doesn't have callbacks for these operations, but it would be possible to add them without major difficulty.

这篇关于使用iText从pdf文件中提取文本列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 13:07