问题描述
我正在使用JTidy v。r938。我使用这段代码试图清理一个页面... final Tidy tidy = new Tidy();
tidy.setQuiet(false);
tidy.setShowWarnings(true);
tidy.setShowErrors(0);
tidy.setMakeClean(true);
Document document = tidy.parseDOM(conn.getInputStream(),null);
但是当我解析这个URL时 - ,事情并没有得到清理。例如,页面上的META标签,如
< META http-equiv =Content-Typecontent = text / html; charset = UTF-8>
保留为
< META http-equiv =Content-Typecontent =text / html; charset = UTF-8>
代替< / META>标记或显示为< META http-equiv =Content-Typecontent =text / html;字符集= UTF-8 /> 中。我通过输出生成的JTidy org.w3c.dom.Document作为字符串来确认。
我能做些什么来使JTidy真正地清理页面 - 即使它格式良好?我意识到还有其他工具,但这个问题具体涉及到使用JTIdy。
您需要为Tidy指定几个标志如果你想要XML格式的话
pre $ private $ String cleanData(String data)throws UnsupportedEncodingException {
Tidy tidy = new Tidy() ;
tidy.setInputEncoding(UTF-8);
tidy.setOutputEncoding(UTF-8);
tidy.setWraplen(Integer.MAX_VALUE);
tidy.setPrintBodyOnly(true);
tidy.setXmlOut(true);
tidy.setSmartIndent(true);
ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes(UTF-8));
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
tidy.parseDOM(inputStream,outputStream);
return outputStream.toString(UTF-8);
}
或者,如果需要XHTML表单
Tidy tidy = new Tidy();
tidy.setXHTML(true);
I'm using JTidy v. r938. I'm using this code to attempt to clean up a page …
final Tidy tidy = new Tidy();
tidy.setQuiet(false);
tidy.setShowWarnings(true);
tidy.setShowErrors(0);
tidy.setMakeClean(true);
Document document = tidy.parseDOM(conn.getInputStream(), null);
But when I parse this URL -- http://www.chicagoreader.com/chicago/EventSearch?narrowByDate=This+Week&eventCategory=93922&keywords=&page=1, things aren't getting cleaned up. For example, the META tags on the page, like
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
remain as
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
instead of having a "</META>" tag or appearing as "<META http-equiv="Content-Type" content="text/html; charset=UTF-8"/>". I confirm this by outputting the resulting JTidy org.w3c.dom.Document as a String.
What can I do to make JTidy truly clean up the page -- i.e. make it well-formed? I realize there are other tools out there, but this question specifically relates to using JTIdy.
You need specify several flags to Tidy if you want XML format
private String cleanData(String data) throws UnsupportedEncodingException {
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setWraplen(Integer.MAX_VALUE);
tidy.setPrintBodyOnly(true);
tidy.setXmlOut(true);
tidy.setSmartIndent(true);
ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8"));
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
tidy.parseDOM(inputStream, outputStream);
return outputStream.toString("UTF-8");
}
Or simply if want XHTML form
Tidy tidy = new Tidy();
tidy.setXHTML(true);
这篇关于如何让JTIdy使HTML文档格式良好?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!