问题描述
我使用Jsoup从HTML页面中删除所有图像。
我通过HTTP响应接收该页面 - 其中还包含内容字符集。
I'm using Jsoup to remove all the images from an HTML page.I'm receiving the page through an HTTP response - which also contains the content charset.
问题是Jsoup会解除一些特殊字符。
The problem is that Jsoup unescapes some special characters.
例如,对于输入:
<html><head></head><body><p>isn’t</p></body></html>
运行
String check = "<html><head></head><body><p>isn’t</p></body></html>";
Document doc = Jsoup.parse(check);
System.out.println(doc.outerHtml());
我得到:
<html><head></head><body><p>isn’t</p></body></html><p></p>
除了删除图片,我想避免更改html / strong>
I want to avoid changing the html in any other way except for removing the images.
使用命令
doc.outputSettings()。prettyPrint(false).charset(ASCII)。escapeMode (EscapeMode.extended);
我得到正确的输出,但我敢肯定有一些情况下,该字符集不会是好的。我只想使用HTTP头中指定的字符集,我害怕这将更改我的文档,我无法预测的方式。
有没有其他更清洁的方法来删除图像而不改变任何其他的东西?
Using the commanddoc.outputSettings().prettyPrint(false).charset("ASCII").escapeMode(EscapeMode.extended);
I do get the correct output but I'm sure there are cases where that charset won't be good. I just want to use the charset specified in the HTTP header and I'm afraid this will change my document in ways I can't predict.Is there any other cleaner method for removing the images without changing anything else inadvertently?
谢谢!
推荐答案
这里是一个解决方法,不涉及任何字符集,除了HTTP标头中指定的字符集。
Here is a workaround not involving any charset except the one specified in the HTTP header.
String check = "<html><head></head><body><p>isn’t</p></body></html>".replaceAll("&([^;]+?);", "**$1;");
Document doc = Jsoup.parse(check);
doc.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended);
System.out.println(doc.outerHtml().replaceAll("\\*\\*([^;]+?);", "&$1;"));
OUTPUT
<html><head></head><body><p>isn’t</p></body></html>
讨论
我希望Jsoup的API有一个解决方案 - @dlv
使用Jsoup'API将需要您编写一个自定义NodeVisitor。这将导致(重新)发明Jsoup中的一些现有代码。自定义Nodevisitor将生成一个HTML转义代码,而不是unicode字符。
Using Jsoup'API would require you to write a custom NodeVisitor. It would leads to (re)inventing some existing code inside Jsoup. The custom Nodevisitor would generate back an HTML escape code instead of a unicode character.
另一个选项将涉及编写自定义字符编码器。默认的UTF-8字符编码器可以编码& rsquo;
。这就是为什么Jsoup不会在最终的HTML代码中保留原始的转义序列。
Another option would involve writing a custom character encoder. The default UTF-8 character encoder can encode ’
. This is why Jsoup doesn't preserve the original escape sequence in the final HTML code.
上面两个选项中的任何一个代表一个大的编码工作。最终,可以向Jsoup添加一个增强功能,让我们选择如何在最终的HTML代码中生成字符:十六进制转义(& #AB;
),十进制转义&#151;
),原始转义序列(& rsquo;
)或写入编码字符是你的帖子的情况)。
Any of the two above options represents a big coding effort. Ultimately, an enhancement could be added to Jsoup for letting us choose how to generate the characters in the final HTML code : hexadecimal escape (&#AB;
), decimal escape (—
), the original escape sequence (’
) or write the encoded character (which is the case in your post).
这篇关于Jsoup未翻译特殊字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!