问题描述
我正在使用
<$清理不需要的HTML标记中的文本(例如
< script>
) p $ p> String clean = Jsoup.clean(someInput,Whitelist.basicWithImages());
问题在于它取代了å
with & aring;
(这对我造成麻烦,因为它不是纯xml)。 例子
Jsoup.clean(hello< script>< / script> world,Whitelist.basicWithImages()) b
$ b $ p code>hello& aring; world
但我想要
helloåworld
有没有一种简单的方法来实现这一目标? (比在结果中将& aring;
返回到å
)简单很多。) $ b $你可以配置Jsoup的转义模式:使用 EscapeMode.xhtml
会给你输出没有实体。
下面是一个完整的代码片段,它接受 str
作为输入,并使用白名单.simpleText()
:
//解析一个文档
Document doc = Jsoup.parse(STR);
//清理文档。
doc = new Cleaner(Whitelist.simpleText())。clean(doc);
//调整转义模式
doc.outputSettings()。escapeMode(EscapeMode.xhtml);
//取回正文的字符串。
str = doc.body()。html();
I'm cleaning some text from unwanted HTML tags (such as <script>
) by using
String clean = Jsoup.clean(someInput, Whitelist.basicWithImages());
The problem is that it replaces for instance å
with å
(which causes troubles for me since it's not "pure xml").
For example
Jsoup.clean("hello å <script></script> world", Whitelist.basicWithImages())
yields
"hello å world"
but I would like
"hello å world"
Is there a simple way to achieve this? (I.e. simpler than converting å
back to å
in the result.)
You can configure Jsoup's escaping mode: Using EscapeMode.xhtml
will give you output w/o entities.
Here's a complete snippet that accepts str
as input, and cleans it using Whitelist.simpleText()
:
// Parse str into a Document
Document doc = Jsoup.parse(str);
// Clean the document.
doc = new Cleaner(Whitelist.simpleText()).clean(doc);
// Adjust escape mode
doc.outputSettings().escapeMode(EscapeMode.xhtml);
// Get back the string of the body.
str = doc.body().html();
这篇关于没有添加html实体的Jsoup.clean的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!