问题描述
我发布这个问题是因为很多开发人员用不同的形式提出或多或少的相同问题。我将自己回答这个问题(我是iText集团的创始人/首席技术官),因此它可以成为Wiki-answer。如果堆栈溢出文档功能仍然存在,这将是一个很好的候选文档主题。源文件:
我试图将以下HTML文件转换为PDF:
< html>
< head>
< title>巨大(电影)< / title>
< style>
.poster {width:120px; float:right; }
.director {font-style:italic; }
.description {font-family:serif; }
.imdb {font-size:0.8em; }
a {color:red; }
< / style>
< / head>
< body>
< img src =img / colossal.jpgclass =poster/>
< h1>巨大(2016)< / h1>
< div class =director>导演Nacho Vigalondo< / div>
< div class =description>格洛丽亚是一个失业的派对女郎,被迫离开她在纽约市的生活,搬回家。
当报道显示一个巨型生物正在销毁首尔时,她逐渐意识到,她以某种方式将
与这种现象联系起来。
< / div>
< div class =imdb>在
< a href =www.imdb.com/title/tt4680182> IMDB< / a>
< / div>
< / body>
< / html>
在浏览器中,此HTML如下所示:
我遇到的问题:
HTMLWorker根本不考虑CSS
当我使用 HTMLWorker ,我需要创建一个 ImageProvider 来避免一个错误,告诉我无法找到图像。我还需要创建一个 StyleSheet 实例来更改一些样式:
public static class MyImageFactory implements ImageProvider {
public Image getImage(String src,Map< String,String> h,
ChainedProperties cprops,DocListener doc){
try {
return Image.getInstance(
String.format(resources / html / img /%s,
src.substring(src.lastIndexOf(/)+ 1)));
} catch(DocumentException e){
e.printStackTrace();
} catch(IOException e){
e.printStackTrace();
}
返回null;
public static void main(String [] args)throws IOException,DocumentException {
Document document = new Document();
PdfWriter.getInstance(document,new FileOutputStream(results / htmlworker.pdf));
document.open();
StyleSheet styles = new StyleSheet();
styles.loadStyle(imdb,size,-3);
HTMLWorker htmlWorker = new HTMLWorker(document,null,styles);
HashMap< String,Object> providers = new HashMap< String,Object>();
providers.put(HTMLWorker.IMG_PROVIDER,new MyImageFactory());
htmlWorker.setProviders(providers);
htmlWorker.parse(新的FileReader(resources / html / sample.html));
document.close();
}
结果如下所示:
所以我决定升级到使用XML Worker。
使用XML Worker时找不到图像我试过以下代码:
public static final String DEST =results / xmlworker1.pdf;
public static final String HTML =resources / html / sample.html;
public void createPdf(String file)抛出IOException,DocumentException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document,new FileOutputStream(file));
document.open();
XMLWorkerHelper.getInstance()。parseXHtml(writer,document,$ b $ new FileInputStream(HTML));
document.close();
}
结果如下PDF:
代替Times-Roman,使用默认字体Helvetica;这是iText的典型特征(我应该在我的HTML中明确定义一种字体)。否则,CSS似乎受到尊重,但图像丢失,并且我没有收到错误消息。
使用 HTMLWorker ,抛出了一个异常,并且通过引入 ImageProvider 来解决问题。让我们来看看这是否适用于XML Worker。
并非所有CSS样式都支持XML Worker
我改编了这样的代码:
public static final String DEST =results / xmlworker2.pdf;
public static final String HTML =resources / html / sample.html;
public static final String IMG_PATH =resources / html /;
public void createPdf(String file)抛出IOException,DocumentException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document,new FileOutputStream(file));
document.open();
CSSResolver cssResolver =
XMLWorkerHelper.getInstance()。getDefaultCssResolver(true);
HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);
htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
htmlContext.setImageProvider(new AbstractImageProvider(){
public String getImageRootPath(){
return IMG_PATH;
}
});
PdfWriterPipeline pdf = new PdfWriterPipeline(document,writer);
HtmlPipeline html = new HtmlPipeline(htmlContext,pdf);
CssResolverPipeline css = new CssResolverPipeline(cssResolver,html);
XMLWorker worker = new XMLWorker(css,true);
XMLParser p = new XMLParser(worker);
p.parse(new FileInputStream(HTML));
document.close();
}
我的代码长得多,但现在呈现图像:
图片比我用 HTMLWorker 它告诉我海报的CSS属性 width 类被考虑在内,但是 float 属性被忽略。如何解决这个问题?
剩下的问题:
所以问题归结为:我有我尝试转换为PDF的特定 HTML文件。我经历了很多工作,一个接一个地解决了一个问题,但是有一个我无法解决的特定问题:如何让iText尊重定义元素位置的CSS ,例如 float:right ?
其他问题:
当我的HTML包含表单元素(例如< input> )时,这些表单元素将被忽略。
为什么你的代码无法正常工作
正如
正如您所看到的,这几乎是您期望的结果。由于iText 7.1.0 / pdfHTML 2.0.0,默认字体是Times-Roman。该CSS正在被尊重:图像现在漂浮在右边。
一些额外的想法。
开发人员当我给出建议升级到iText 7 / pdfHTML 2时,经常会反对升级到更新的iText版本。2.请允许我回答我听到的前三个参数:
iText 7是使用AGPL发布的,就像iText 5和XML Worker一样。在开源项目的背景下,AGPL允许免费使用 。如果您要发布封闭的源代码/专有产品(例如,您在SaaS环境中使用iText),则无法免费使用iText;在这种情况下,您必须购买商业许可证。 iText 5已经如此; iText 7仍然如此。至于iText 5之前的版本:。关于pdfHTML:第一个版本确实只能作为封闭源代码软件使用。我们在iText集团内进行了深入的讨论:一方面,有些人希望避免那些不听开发人员的公司的大规模滥用,当那些开发人员告诉权力时,开放源码不是与免费一样。开发商告诉我们,他们的老板迫使他们做错了事,他们无法说服老板购买商业许可证。另一方面,有些人认为我们不应该惩罚开发者因为老板的错误行为。最终,赞成开源pdfHTML的人们,即:iText的开发人员赢得了争论。请证明他们没有错,并正确使用iText:如果您使用iText 免费,请尊重AGPL;确保你的老板购买商业授权,如果你在封闭的源代码上下文中使用iText的话。
我需要维护一个遗留系统,使用旧的iText版本。
认真吗?维护还涉及应用升级并迁移到正在使用的新软件版本。正如您所看到的,使用iText 7和pdfHTML时所需的代码非常简单,并且比以前需要的代码更少出错。迁移项目不应该花太长时间。
我刚刚开始并且不知道iText 7;我在完成我的项目后才发现。
这就是我发布这个问题和答案的原因。把自己想象成一名eXtreme程序员。扔掉你的所有代码,重新开始。你会注意到,这并不像你想象的那么有效,而且你知道你已经让你的项目面向未来,因为iText 5正在被淘汰,所以你会睡得更好。我们仍为付费客户提供支持,但最终我们会停止支持iText 5。
I am posting this question because many developers ask more or less the same question in different forms. I will answer this question myself (I am the Founder/CTO of iText Group), so that it can be a "Wiki-answer." If the Stack Overflow "documentation" feature still existed, this would have been a good candidate for a documentation topic.
The source file:
I am trying to convert the following HTML file to PDF:
<html> <head> <title>Colossal (movie)</title> <style> .poster { width: 120px;float: right; } .director { font-style: italic; } .description { font-family: serif; } .imdb { font-size: 0.8em; } a { color: red; } </style> </head> <body> <img src="img/colossal.jpg" class="poster" /> <h1>Colossal (2016)</h1> <div class="director">Directed by Nacho Vigalondo</div> <div class="description">Gloria is an out-of-work party girl forced to leave her life in New York City, and move back home. When reports surface that a giant creature is destroying Seoul, she gradually comes to the realization that she is somehow connected to this phenomenon. </div> <div class="imdb">Read more about this movie on <a href="www.imdb.com/title/tt4680182">IMDB</a> </div> </body> </html>
In a browser, this HTML looks like this:
The problems I encountered:
HTMLWorker doesn't take CSS into account at all
When I used HTMLWorker, I need to create an ImageProvider to avoid an error that informs me that the image can't be found. I also need to create a StyleSheet instance to change some of the styles:
public static class MyImageFactory implements ImageProvider { public Image getImage(String src, Map<String, String> h, ChainedProperties cprops, DocListener doc) { try { return Image.getInstance( String.format("resources/html/img/%s", src.substring(src.lastIndexOf("/") + 1))); } catch (DocumentException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } return null; } } public static void main(String[] args) throws IOException, DocumentException { Document document = new Document(); PdfWriter.getInstance(document, new FileOutputStream("results/htmlworker.pdf")); document.open(); StyleSheet styles = new StyleSheet(); styles.loadStyle("imdb", "size", "-3"); HTMLWorker htmlWorker = new HTMLWorker(document, null, styles); HashMap<String,Object> providers = new HashMap<String, Object>(); providers.put(HTMLWorker.IMG_PROVIDER, new MyImageFactory()); htmlWorker.setProviders(providers); htmlWorker.parse(new FileReader("resources/html/sample.html")); document.close(); }
The result looks like this:
For some reason, HTMLWorker also shows the content of the <title> tag. I don't know how to avoid this. The CSS in the header isn't parsed at all, I have to define all the styles in my code, using the StyleSheet object.
When I look at my code, I see that plenty of objects and methods I'm using are deprecated:
So I decided to upgrade to using XML Worker.
Images aren't found when using XML Worker
I tried the following code:
public static final String DEST = "results/xmlworker1.pdf"; public static final String HTML = "resources/html/sample.html"; public void createPdf(String file) throws IOException, DocumentException { Document document = new Document(); PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file)); document.open(); XMLWorkerHelper.getInstance().parseXHtml(writer, document, new FileInputStream(HTML)); document.close(); }
This resulted in the following PDF:
Instead of Times-Roman, the default font Helvetica is used; this is typical for iText (I should have defined a font explicitly in my HTML). Otherwise, the CSS seems to be respected, but the image is missing, and I didn't get an error message.
With HTMLWorker, an exception was thrown, and I was able to fix the problem by introducing an ImageProvider. Let's see if this works for XML Worker.
Not all CSS styles are supported in XML Worker
I adapted my code like this:
public static final String DEST = "results/xmlworker2.pdf"; public static final String HTML = "resources/html/sample.html"; public static final String IMG_PATH = "resources/html/"; public void createPdf(String file) throws IOException, DocumentException { Document document = new Document(); PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file)); document.open(); CSSResolver cssResolver = XMLWorkerHelper.getInstance().getDefaultCssResolver(true); HtmlPipelineContext htmlContext = new HtmlPipelineContext(null); htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory()); htmlContext.setImageProvider(new AbstractImageProvider() { public String getImageRootPath() { return IMG_PATH; } }); PdfWriterPipeline pdf = new PdfWriterPipeline(document, writer); HtmlPipeline html = new HtmlPipeline(htmlContext, pdf); CssResolverPipeline css = new CssResolverPipeline(cssResolver, html); XMLWorker worker = new XMLWorker(css, true); XMLParser p = new XMLParser(worker); p.parse(new FileInputStream(HTML)); document.close(); }
My code is much longer, but now the image is rendered:
The image is larger than when I rendered it using HTMLWorker which tells me that the CSS attribute width for the poster class is taken into account, but the float attribute is ignored. How do I fix this?
The remaining question:
So the question boils down to this: I have a specific HTML file that I try to convert to PDF. I have gone through a lot of work, fixing one problem after the other, but there is one specific problem that I can't solve: how do I make iText respect CSS that defines the position of an element, such as float: right?
Additional question:
When my HTML contains form elements (such as <input>), those form elements are ignored.
Why your code doesn't work
As explained in the introduction of the HTML to PDF tutorial, HTMLWorker has been deprecated many years ago. It wasn't intended to convert complete HTML pages. It doesn't know that an HTML page has a <head> and a <body> section; it just parses all the content. It was meant to parse small HTML snippets, and you could define styles using the StyleSheet class; real CSS wasn't supported.
Then came XML Worker. XML Worker was meant as a generic framework to parse XML. As a proof of concept, we decided to write some XHTML to PDF functionality, but we didn't support all of the HTML tags. For instance: forms weren't supported at all, and it was very hard to support CSS that is used to position content. Forms in HTML are very different from forms in PDF. There was also a mismatch between the iText architecture and the architecture of HTML + CSS. Gradually, we extended XML Worker, mostly based on requests from customers, but XML Worker became a monster with many tentacles.
Eventually, we decided to rewrite iText from scratch, with the requirements for HTML + CSS conversion in mind. This resulted in iText 7. On top of iText 7, we created several add-ons, the most important one in this context being pdfHTML.
How to solve the problem
Using the latest version of iText (iText 7.1.0 + pdfHTML 2.0.0) the code to convert the HTML from the question to PDF is reduced to this snippet:
public static final String SRC = "src/main/resources/html/sample.html"; public static final String DEST = "target/results/sample.pdf"; public void createPdf(String src, String dest) throws IOException { HtmlConverter.convertToPdf(new File(src), new File(dest)); }
The result looks like this:
As you can see, this is pretty much the result you'd expect. Since iText 7.1.0 / pdfHTML 2.0.0, the default font is Times-Roman. The CSS is being respected: the image is now floating on the right.
Some additional thoughts.
Developers often feel opposed to upgrade to a newer iText version when I give the advice to upgrade to iText 7 / pdfHTML 2. Allow me to answer to the top 3 of arguments I hear:
I need to use the free iText, and iText 7 isn't free / the pdfHTML add-on is closed source.
iText 7 is released using the AGPL, just like iText 5 and XML Worker. The AGPL allows free use in the sense of free of charge in the context of open source projects. If you are distributing a closed source / proprietary product (e.g. you use iText in a SaaS context), you can't use iText for free; in that case, you have to purchase a commercial license. This was already true for iText 5; this is still true for iText 7. As for versions prior to iText 5: you shouldn't use these at all. Regarding pdfHTML: the first versions were indeed only available as closed source software. We have had heavy discussion within iText Group: on the one hand, there were the people who wanted to avoid the massive abuse by companies who don't listen to their developers when those developers tell the powers that be that open source isn't the same as free. Developers were telling us that their boss forced them to do the wrong thing, and that they couldn't convince their boss to purchase a commercial license. On the other hand, there were the people who argued that we shouldn't punish developers for the wrong behavior of their bosses. Eventually, the people in favor of open sourcing pdfHTML, that is: the developers at iText, won the argument. Please prove that they weren't wrong, and use iText correctly: respect the AGPL if you're using iText for free; make sure that your boss purchases a commercial license if you're using iText in a closed source context.
I need to maintain a legacy system, and I have to use an old iText version.
Seriously? Maintenance also involves applying upgrades and migrating to new versions of the software you're using. As you can see, the code needed when using iText 7 and pdfHTML is very simple, and less error-prone than the code needed before. A migration project shouldn't take too long.
I've only just started and I didn't know about iText 7; I only found out after I finished my project.
That's why I'm posting this question and answer. Think of yourself as an eXtreme Programmer. Throw away all of your code, and start anew. You'll notice that it's not as much work as you imagined, and you'll sleep better knowing that you've made your project future-proof because iText 5 is being phased out. We still offer support to paying customers, but eventually, we'll stop supporting iText 5 altogether.
这篇关于使用iText将HTML转换为PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!