问题描述
我试图在遇到以下情况时解析HTML文档.我在以下代码中将内容以字符串的形式放置.在这种情况下,锚标签内有一个P标签.如果用Jsoup解析,它会增加< /a>标记和< a>在#item1附近的标签之间,更改html结构.
I was trying to parse HTML document where I encountered the following scenario. I have put the content in the form of string in the following code. In this there is a P tag inside an anchor tag. If parsed with Jsoup, it adds an extra < /a> tag and < a> tags in between near #item1, changing the html structure.
public class Test{
public static void main(String[] args) {
String html="<A HREF=\"#Item1\">\n"
+ "<p style=\"font-family:times;margin-top:12pt;margin-left:0pt;\">\n"
+ "<FONT SIZE=2>Item 1.</FONT>\n"
+ "</A>";
Document doc = Jsoup.parse(html);
System.out.println("UNPARSED = \n"+html);
System.out.println("JSOUP PARSED = \n"+doc.toString());
}
}
输出
UNPARSED =
<A HREF="#Item1">
<p style="font-family:times;margin-top:12pt;margin-left:0pt;">
<FONT SIZE=2>Item 1.</FONT>
</A>
JSOUP PARSED =
<html>
<head></head>
<body>
<a href="#Item1"> </a>
<p style="font-family:times;margin-top:12pt;margin-left:0pt;"><a> <font size="2">Item 1.</font> </a></p>
</body>
</html>
有什么方法可以避免使用Jsoup自动完成标签.谢谢.
Is there any way to avoid the automatic tag completion using Jsoup.Thank you.
推荐答案
-更新!!
有一个很好的解决此问题的方法:
There is a great solution to this problem:
解析方式:
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
会给:
<a href="#Item1"> <p style="font-family:times;margin-top:12pt;margin-left:0pt;"> <font size="2">Item 1.</font> </p></a>
感谢@ user2784201!
Thanks @user2784201!
-旧响应:
我不确定您所要求的是否可行,但是我认为这违反了JSoup的哲学,即以与浏览器相似的方式解析html的哲学.
I'm not sure if what you are asking for is possible or not, but I think that it goes against JSoup philosophy of parsing html in a way as similar as possible to the way of a browser.
请注意,浏览器也会关闭该A标签.我认为这是因为在HTML4中,禁止在A内放置P.查看此 https://stackoverflow.com/a/1828032/3324704 .
Note that browsers will also close that A tag too. I think this is because in HTML4 putting a P inside an A was forbidden. Look at this https://stackoverflow.com/a/1828032/3324704.
以这种方式,我认为您使用的是JSoup的旧版本,如果使用1.8.1,您会看到内部的A标签(由JSoup以及浏览器放置的虚假标签)将保留href.这个事实可能会帮助您进行解析.参见JSoup 1.8.1的输出(注意内部<a href="#Item1">
):
Bytheway I think you are using an old version of JSoup, if you use 1.8.1 you will see that the inner A tag (a spurious tag put there by JSoup, also by browsers) will mantain the href. This fact may help you in your parsing. See the output of JSoup 1.8.1 (Note the inner <a href="#Item1">
):
JSOUP PARSED =
<!DOCTYPE html>
<html>
<head></head>
<body>
<a href="#Item1"> </a>
<p style="font-family:times;margin-top:12pt;margin-left:0pt;"><a href="#Item1"> <font size="2">Item 1.</font> </a></p>
</body>
</html>
此外,我还尝试了其他库. Htmlcleaner(此处)会引发错误(a-UnpermittedChild)并提供非常相似的输出:
Furthermore, I've tried other libraries. Htmlcleaner (here) fires an error (a - UnpermittedChild) and gives very similar output:
<?xml version="1.0" encoding="UTF-8"?>
<html>
<head></head>
<body><a href="#Item1">
</a><p style="font-family:times;margin-top:12pt;margin-left:0pt;"><a href="#Item1">
<font size="2">Item 1.</font>
</a></p></body></html>
和jtidy(这里)说:
Warning: missing </a> before <p>
并给出:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator"
content="HTML Tidy for Java (vers. 2009-12-01), see jtidy.sourceforge.net" />
<title></title>
</head>
<body>
<a href="#Item1"></a>
<p style="font-family:times;margin-top:12pt;margin-left:0pt;"><font
size="2">Item 1.</font> </p>
</body>
</html>
也许您可以使用常规的XML解析器...
Maybe you could use a regular XML parser...
很抱歉,反应不理想:(
Sorry for the verbosity and the unsatisfactory response :(
这篇关于如何关闭关闭标签</tagName>的自动生成在Jsoup中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!