问题描述
我想文档转换为使用Apache POI PDF格式,但生成的PDF文档中包含,它是不是有像图片,表格等对准任何格式化纯文本。
我怎么可以转换文档为PDF具有类似表格,图片,比对所有的打印格式?
下面是我的code:
进口的java.io.File;
进口java.io.FileInputStream中;
进口java.io.FileOutputStream中;
进口java.io.OutputStream中;进口com.lowagie.text.Document;
进口com.lowagie.text.DocumentException;
进口com.lowagie.text.Paragraph;
进口com.lowagie.text.pdf.PdfWriter;
进口org.apache.poi.hwpf.HWPFDocument;
进口org.apache.poi.hwpf.extractor.WordExtractor;进口org.apache.poi.hwpf.usermodel.Range;
进口org.apache.poi.poifs.filesystem.POIFSFileSystem;
公共类演示{
公共静态无效的主要(字串[] args){ POIFSFileSystem FS =无效;
文档的文档=新的文件(); 尝试{
的System.out.println(开始测试);
FS =新POIFSFileSystem(新的FileInputStream(Resume.doc)); HWPFDocument DOC =新HWPFDocument(FS);
WordExtractor我们=新WordExtractor(DOC); OutputStream的文件=新的FileOutputStream(新文件(的test.pdf)); PdfWriter作家= PdfWriter.getInstance(文档,文件); 范围范围= doc.getRange();
document.open();
writer.setPageEmpty(真);
document.newPage();
writer.setPageEmpty(真); 的String [] =段we.getParagraphText();
的for(int i = 0; I< paragraphs.length;我++){ org.apache.poi.hwpf.usermodel.Paragraph PR = range.getParagraph(ⅰ);
段落[I] =段落[I] .replaceAll(\\\\ cM的\\ r \\ n?,);
的System.out.println(持续时间+段落[I]。长度());
的System.out.println(段落+ I +:+段落[I]的ToString());
//段落添加到文档中
document.add(新段落(段落[I]));
} 的System.out.println(文档测试完成);
}赶上(例外五){
的System.out.println(测试期间异常);
e.printStackTrace();
} {最后
//关闭文档
document.close();
}
}
}
眼下的任务是的转换文档具有类似表格,图片,比对所有打印格式为PDF。的
创建自己的转换器类
有已经是 WordToXxxConverter
班的Apache POI,即<一个href=\"http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToFoConverter.java?view=markup\"相对=nofollow> WordToFoConverter ,<一个href=\"http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToHtmlConverter.java?revision=1406208&view=markup\"相对=nofollow> WordToHtmlConverter 和<一个href=\"http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToTextConverter.java?revision=1189612&view=markup\"相对=nofollow> WordToTextConverter 。后者最有可能是太有损,作为您的要求一个例子,但前两者都是足够的。
所有这些转换器类是从共同的基类<派生href=\"http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/AbstractWordConverter.java?view=markup\"相对=nofollow> AbstractWordConverter 它提供了一个基本框架字转换类。此外,所有这些类使用匹配的 * DocumentFacade
类包装的具体目标(或中间)格式创建:<一href=\"http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/FoDocumentFacade.java?view=markup\"相对=nofollow> FoDocumentFacade ,<一个href=\"http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/HtmlDocumentFacade.java?view=markup\"相对=nofollow> HtmlDocumentFacade ,或<一个href=\"http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/TextDocumentFacade.java?view=markup\"相对=nofollow> TextDocumentFacade 。
要实现你的任务的转换文档具有类似表格,图像比对,的,因此,你应该也从中获得从<一个转换器类中的所有打印格式为PDF格式href=\"http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/AbstractWordConverter.java?view=markup\"相对=nofollow> AbstractWordConverter 并实现抽象方法让自己的三个具体实现类的启发。就像在其它转换器类,主要集中在非常PDF库具体code到 PdfDocumentFacade
类似乎是一个不错的主意。
如果你想从简单开始以后添加更多复杂的细节,您可以通过使用多<一开始href=\"http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToTextConverter.java?revision=1189612&view=markup\"相对=nofollow> WordToTextConverter 执行code第一,尽快的作品至少在验证的概念层面,扩展功能也包括越来越多的格式信息。
的不幸的是该转换器架构是有点DOM元素为中心:<一href=\"http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/AbstractWordConverter.java?view=markup\"相对=nofollow> AbstractWordConverter 回调预期和前向DOM元素作为当前目标文档上下文的指标;乍一看这似乎并没有充分利用这方面的作为一个DOM元素,所以你可能会复制基类,并以更加中肯的类型,甚至更好的通用类参数交换的DOM元素参数脱身。
与现有XXX到PDF转换器结合使用现有的Word到XXX转换器
如果这似乎过于复杂或耗时你的资源,你可以尝试不同的方法:你可以尝试使用上面提到的输入另一个转换为PDF现有的转换器之一的输出
使用现有的转化的类将较早转出的结果,但多步转化往往比单步那些更有损。这一决定是由你。
在code在你所使用的iText类你的问题公布。 iText的使用确实支持转换从HTML与一定的局限性PDF中在子项目。在古代iText的版本也有曾经是,现在去precated HTMLWorker
类。因此,使用 XMLWorker
可能是你的选择。
另外阿帕奇还提供XSL FO处理PDF。这适用于输出<一个href=\"http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToFoConverter.java?view=markup\"相对=nofollow> WordToFoConverter 的也可能是一种选择
I am trying to convert doc to pdf using Apache POI, but the resulting pdf document contains only text, it is not having any formating like images, tables alignment etc.
How can I convert doc to pdf with having all formattings like tables, images, alignments?
Here is my code:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.OutputStream;
import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class demo {
public static void main(String[] args) {
POIFSFileSystem fs = null;
Document document = new Document();
try {
System.out.println("Starting the test");
fs = new POIFSFileSystem(new FileInputStream("Resume.doc"));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
OutputStream file = new FileOutputStream(new File("test.pdf"));
PdfWriter writer = PdfWriter.getInstance(document, file);
Range range = doc.getRange();
document.open();
writer.setPageEmpty(true);
document.newPage();
writer.setPageEmpty(true);
String[] paragraphs = we.getParagraphText();
for (int i = 0; i < paragraphs.length; i++) {
org.apache.poi.hwpf.usermodel.Paragraph pr = range.getParagraph(i);
paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");
System.out.println("Length:" + paragraphs[i].length());
System.out.println("Paragraph" + i + ": " + paragraphs[i].toString());
// add the paragraph to the document
document.add(new Paragraph(paragraphs[i]));
}
System.out.println("Document testing completed");
} catch (Exception e) {
System.out.println("Exception during test");
e.printStackTrace();
} finally {
// close the document
document.close();
}
}
}
The task at hand is converting doc to pdf with having all formattings like tables, images, alignments.
Creating an own converter class
There already are WordToXxxConverter
classes in Apache POI, namely WordToFoConverter, WordToHtmlConverter, and WordToTextConverter. The latter one most likely is too lossy to serve as an example for your requirements but the former two are adequate.
All these converter classes are derived from the common base class AbstractWordConverter which provides a basic framework for word conversion classes. Furthermore all these classes make use of a matching *DocumentFacade
class which wraps the concrete target (or some intermediate) format creation: FoDocumentFacade, HtmlDocumentFacade, or TextDocumentFacade.
To implement your task converting doc to pdf with having all formattings like tables, images, alignments, therefore, you should also derive a converter class from that AbstractWordConverter and for implementing the abstract methods let yourself be inspired by the three concrete implementation classes. Just like in the other converter classes, concentrating the very PDF library specific code into a PdfDocumentFacade
class seems like a good idea.
If you want to start simple and add the more complex details later, you might start by using much WordToTextConverter implementation code first and as soon as that works at least on a proof-of-concept level, extend the functionality to also cover more and more of the formatting information.
Unfortunately this converter framework is somewhat DOM element centric: AbstractWordConverter callbacks expect and forward DOM elements as indicators of the current target document context; at first glance it does not seem to make use of that context being a DOM element, so you might get away with copying that base class and exchanging those DOM element parameters with a more apropos type or even better a generic class parameter.
Using existing Word-to-XXX converters in combination with existing XXX-to-Pdf converters
If this seems too complex or time consuming for your resources, you might try a different approach: You can try to use the output of one of the existing converters mentioned above as input for another conversion to Pdf.
Using existing conversion classes will turn out results earlier, but multi-step conversions tend to be more lossy than single-step ones. The decision is up to you.
In the code you posted in your question you used iText classes. iText does support conversion from HTML to PDF with certain limitations using the XMLWorker
provided in the iText XML Worker sub-project. In ancient iText versions there also used to be the now deprecated HTMLWorker
class. Thus using the WordToHtmlConverter in combination with the iText XMLWorker
may be an option for you.
Alternatively Apache also provides XSL FO processing to PDF. This applied to the output of WordToFoConverter may also be an option
这篇关于转换的文档使用Apache POI到PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!