问题描述
我需要使用iText从pdf文件中提取文本。
问题是:一些pdf文件包含2列,当我提取文本时,我得到一个文本文件列被合并为结果(即同一行中两列的文本)
这是代码:
public class pdf
{
private static String INPUTFILE =http://www.revuemedecinetropicale.com/TAP_519-522_-_AO_07151GT_Rasoamananjara__ao.pdf;
private static String OUTPUTFILE =c:/new3.pdf;
public static void main(String [] args)抛出DocumentException,IOException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document,new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
PdfImportedPage页面;
//遍历所有页面
for(int i = 1; i< = n; i ++){
page = writer.getImportedPage(reader,i);
Image instance = Image.getInstance(page);
document.add(instance);
}
document.close();
PdfReader readerN = new PdfReader(OUTPUTFILE);
for(int i = 1; i< = n; i ++){
String myLine = PdfTextExtractor.getTextFromPage(readerN,i);
System.out.println(myLine);
try {
FileWriter fw = new FileWriter(c:/yo.txt,true);
fw.write(myLine);
fw.close();
} catch(IOException ioe){ioe.printStackTrace(); }
}
}
你能帮我完成这项任务吗? / p>
我是iText文本提取子系统的作者。你需要做的是开发自己的文本提取策略(如果你看看如何实现 PdfTextExtractor.getTextFromPage
,你会发现你可以提供一个可插拔的策略。) / p>
如何确定列的开始和停止位置完全取决于您 - 这是一个难题 - PDF没有任何列的概念(哎呀,它甚至没有单词的概念 - 只是把默认策略提供的文本提取放在一起是非常棘手的。如果您知道列的高级位置,那么您可以在文本渲染侦听器回调中使用区域过滤器(iText库中有代码用于执行此操作,最新版本的iText In Action书籍提供了详细示例) 。
如果你需要从任意数据中获取列,你就可以在你面前做一些算法工作了(如果你得到了一些工作,我会喜欢看)。关于如何处理这个的一些想法:
- 使用类似于默认文本提取策略(LocationAware ...)中使用的算法来获取单词列表和X / Y位置(确保也考虑旋转角度)
- 对于每个单词,绘制一条运行页面整个高度的虚线。扫描从相同X位置开始的所有其他单词。
- 扫描时,还要查找与X位置相交的单词(但不要在X位置开始)。这将为您提供页面上列开始/停止Y位置的潜在位置。
- 一旦有了X和Y列,就可以使用区域过滤方法
另一种可能同样可行的方法是分析绘制操作并查找长水平和垂直线(假设列以类似表的格式划分)。目前,iText内容解析器没有针对这些操作的回调,但是可以毫无困难地添加它们。
I need to extract text from pdf files using iText.
The problem is: some pdf files contain 2 columns and when I extract text I get a text file where columns are merged as the result (i.e. text from both columns in the same line)
this is the code:
public class pdf
{
private static String INPUTFILE = "http://www.revuemedecinetropicale.com/TAP_519-522_-_AO_07151GT_Rasoamananjara__ao.pdf" ;
private static String OUTPUTFILE = "c:/new3.pdf";
public static void main(String[] args) throws DocumentException, IOException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
PdfImportedPage page;
// Go through all pages
for (int i = 1; i <= n; i++) {
page = writer.getImportedPage(reader, i);
Image instance = Image.getInstance(page);
document.add(instance);
}
document.close();
PdfReader readerN = new PdfReader(OUTPUTFILE);
for (int i = 1; i <= n; i++) {
String myLine = PdfTextExtractor.getTextFromPage(readerN,i);
System.out.println(myLine);
try {
FileWriter fw = new FileWriter("c:/yo.txt",true);
fw.write(myLine);
fw.close();
}catch (IOException ioe) {ioe.printStackTrace(); }
}
}
Could you please help me with the task?
I am the author of the iText text extraction sub-system. What you need to do is develop your own text extraction strategy (if you look at how PdfTextExtractor.getTextFromPage
is implemented, you will see that you can provide a pluggable strategy).
How you are going to determine where columns start and stop is entirely up to you - this is a difficult problem - PDF doesn't have any concept of columns (heck, it doesn't even have a concept of words - just putting together the text extraction that the default strategy provides is quite tricky). If you know in advanced where the columns are, then you can use a region filter on the text render listener callback (there is code in the iText library for doing this, and the latest version of the iText In Action book gives a detailed example).
If you need to obtain columns from arbitrary data, you've got some algorithm work ahead of you (if you get something working, I'd love to take a look). Some ideas on how to approach this:
- Use an algorithm similar to that used in the default text extraction strategy (LocationAware...) to obtain a list of words and X/Y locations (be sure to account for rotation angle as well)
- For each word, draw an imaginary line running the full height of the page. Scan for all other words that start at the same X position.
- While scanning, also look for words that intersect the X position (but do not start on the X position). This will give you potential location for column start/stop Y positions on the page.
- Once you have column X and Y, you can resort to a region filtered approach
Another approach that may be equally feasible would be to analyze draw operations and look for long horizontal and vertical lines (assuming the columns are demarcated in a table-like format). Right now, the iText content parser doesn't have callbacks for these operations, but it would be possible to add them without major difficulty.
这篇关于使用iText从pdf文件中提取文本列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!