问题描述
我一直在使用来分析歌词,到目前为止它一直很棒,但遇到了问题。
我可以使用 Node.html()
返回所需节点的完整HTML,它保留换行符因此:
Gl& oacute; andi augu,silfurn& aacute; tt
< br />> Bl& amp ; oacute;&安培; ETH; alv& ouml; ru,starir& aacute;
< br />& Oacute;& eth; ur hundur er& iacute; v& iacute; gam& oacute;& eth,& iacute; maga ... m& eacute; r
< br />
< br /> Kolni& eth; ur gref,kvik sem dreg h& eacute; r
> Kolni& eth; ur svart,hvergi bjart n& eacute;
但是,如您所见,存在不幸的副作用,即保留HTML实体和标记。 / p>
但是,如果我使用 Node.text()
,我可以获得更好的结果,不含标签和实体:
Glóandiaugu,silfurnáttBlóðalvöru,stariráÓðurhundur erívígamóð,ímaga ...mérKolniðurgref, kvik sem dreghérKolniðursvart,
另一个不幸的副作用是删除换行符和压缩
在调用<$ c $之前,只需从节点中替换< br />>
c> Node.text()产生相同的结果,并且似乎该方法将文本压缩到方法本身的一行中,忽略换行符。
是否有两全其美,并且标签和实体可以正确替换以保留换行符,或者是否有其他方法或方法解码实体和删除标签而不必手动替换它们?
(免责声明)我还没有使用过这个API .. 。
但快速查看文档表明您可以访问每个后代节点并转储其文本内容。遇到特殊标签(如< br>
)时,可以插入分隔符。
调用也很有用。
I have been using JSoup to parse lyrics and it has been great until now, but have run into a problem.
I can use Node.html()
to return the full HTML of the desired node, which retains line breaks as such:
Glóandi augu, silfurnátt
<br />Blóð alvöru, starir á
<br />Óður hundur er í vígamóð, í maga... mér
<br />
<br />Kolniður gref, kvik sem dreg hér
<br />Kolniður svart, hvergi bjart né
But has the unfortunate side-effect, as you can see, of retaining HTML entities and tags.
However, if I use Node.text()
, I can get a better looking result, free of tags and entities:
Glóandi augu, silfurnátt Blóð alvöru, starir á Óður hundur er í vígamóð, í maga... mér Kolniður gref, kvik sem dreg hér Kolniður svart,
Which has another unfortunate side-effect of removing the line breaks and compressing into a single line.
Simply replacing <br />
from the node before calling Node.text()
yields the same result, and it seems that that method is compressing the text onto a single line in the method itself, ignoring newlines.
Is it possible to have the best of both worlds, and have tags and entities replaced correctly which preserving the line breaks, or is there another method or way of decoding entities and removing tags without having to replace them manually?
(disclaimer) I haven't used this API ... but a quick look at the docs suggests that you could visit each descendent node and dump out its text contents. Breaks could be inserted when special tags like <br>
are encountered.
The TextNode.getWholeText() call also looks useful.
这篇关于删除HTML实体,同时用JSoup保留换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!