问题描述
我正在使用HtmlCleaner库进行html内容提取.它工作正常,但没有什么限制.
I am using HtmlCleaner library for html content extraction. It works fairly but with few limitations.
它不能处理特殊字符,例如&磅或引号等.对于e.x.网址: http: //www.basicelegancefurnishings.co.uk/alaska-3-and-2-seater-sofa-setspan-classukmadespan-p-280.html ,在给价格提供xpath时,它给了我&磅; "代替£
It is not able to handle special characters like £ or quotes etc. For e.x.for url : http://www.basicelegancefurnishings.co.uk/alaska-3-and-2-seater-sofa-setspan-classukmadespan-p-280.html, On giving xpath to price, It gives me "& pound;" inplace of £
我们可以在htmlcleaner中设置任何属性来处理此解决方案或其他解决方案.
Is there any property which we can set in htmlcleaner for handling this or any other solution.
谢谢
Jitendra
推荐答案
不,我不相信HtmlCleaner可以做到这一点.但是,您可以使用 Apache Commons StringEscapeUtils 取消转义" html,如下所示:
No, I don't believe HtmlCleaner can do this. However, you can use Apache Commons StringEscapeUtils to "unescape" the html, like this:
StringEscapeUtils.unescapeHtml("£679.00");
将产生£679.00
.
我建议您尝试使用 JSoup ,而不是HtmlCleaner.
Instead of HtmlCleaner, I would recommend you try JSoup.
这篇关于处理特殊实体,例如& nbsp; ,&磅;在HtmlCleaner中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!