本文介绍了处理特殊实体,例如& nbsp; ,&磅;在HtmlCleaner中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用HtmlCleaner库进行html内容提取.它工作正常,但没有什么限制.

I am using HtmlCleaner library for html content extraction. It works fairly but with few limitations.

它不能处理特殊字符,例如&磅或引号等.对于e.x.网址: http: //www.basicelegancefurnishings.co.uk/alaska-3-and-2-seater-sofa-setspan-classukmadespan-p-280.html ,在给价格提供xpath时,它给了我&磅; "代替£

It is not able to handle special characters like &pound or quotes etc. For e.x.for url : http://www.basicelegancefurnishings.co.uk/alaska-3-and-2-seater-sofa-setspan-classukmadespan-p-280.html, On giving xpath to price, It gives me "& pound;" inplace of £

我们可以在htmlcleaner中设置任何属性来处理此解决方案或其他解决方案.

Is there any property which we can set in htmlcleaner for handling this or any other solution.

谢谢

Jitendra

推荐答案

不,我不相信HtmlCleaner可以做到这一点.但是,您可以使用 Apache Commons StringEscapeUtils 取消转义" html,如下所示:

No, I don't believe HtmlCleaner can do this. However, you can use Apache Commons StringEscapeUtils to "unescape" the html, like this:

StringEscapeUtils.unescapeHtml("£679.00");

将产生£679.00.

我建议您尝试使用 JSoup ,而不是HtmlCleaner.

Instead of HtmlCleaner, I would recommend you try JSoup.

这篇关于处理特殊实体,例如& nbsp; ,&磅;在HtmlCleaner中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-20 00:00