问题描述
当我解析本地HTML文件时,jsoup会将锚元素内的引号更改为&使我的HTML变得晦涩难懂.
When I parse local HTML files jsoup changes quotes inside an anchor element to & obscuring my HTML.
假设我想在以下HTML部分中将值一"更改为二":
let's assume i want to change the value "one" to "two" in the following HTML part:
<div class="pg2-txt1">
<a class="foo" appareantly_a_javascript_statement='{"targetId":"pg1-magn1", "ordinal":1}'>one</a>
</div>
我得到的是:
<div class="pg2-txt1">
<a class="foo" appareantly_a_javascript_statement="{"targetId":"pg1-magn1", "ordinal":1}">two</a>
</div>
anchor元素内的引号是必需的.我的代码现在看起来像这样:
The quotes inside the anchor element are needed. My code looks like this now:
File input = new File("D:/javatest/page02.html");
Document doc = Jsoup.parse(input, "UTF-8");
Element div = doc.select("div.pg2-txt1").first(); //anchor element only identifyable by parent <div> class
div.child(0).text("one"); //actual anchor element
我尝试了
doc.outputSettings().prettyPrint(false);
没有成功.
我可以用jsoup实现吗?我是否必须使用其他解析器以及它的外观如何?
Can I achieve this with jsoup? Do I have to use a different parser and how would that look like.
非常感谢您.
推荐答案
根据 html规范 JSoup表现得很好:
According to the html spec JSoup behaves totally fine:
注意最后一句话!
基本上,这意味着您的其他需要appareantly_a_javascript_statement
属性中双引号的软件正在对其值进行一些不完整的解析.
Basically that means, that your other software that needs the double quotes in the appareantly_a_javascript_statement
attribute is doing some incomplete parsing of its value.
我看到两种解决方案:
1)修改解释appareantly_a_javascript_statement值的函数
由于我不知道该怎么做,因此我在这里无法为您提供帮助.
I can't help you there, since I have no knowledge of where it is done.
2)通过正则表达式更改Jsoup输出.
这很hacky ...
This is pretty hacky...
String html = doc.outerHtml();
boolean changed = false;
html = html.replaceAll("(=\"\\{)([^\"]+)(\")", "='{$2'");
do{
int oldLength = html.length();
html = html.replaceAll("(=')([^']+)(\\")([^\']+)(')", "$1$2\"$4$5");
changed = html.length() != oldLength;
}while(changed);
System.out.print(html);
这篇关于jsoup-阻止jsoup对& amp;的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!