问题描述
我想根据某些网页的整体DOM结构而不是其特定内容来进行比较.为此,我需要一种类似于标签层次结构但不包括属性或文本标签内容的表示形式.
I want to compare some webpages based on their overall DOM structure but not their particular contents. To this end i need a representation that resembles the tag hierachy but does not include attributes or textual tag-contents.
基本上,我想转成这样的表示形式
Basically, I want to turn a representation like this
<!DOCTYPE html>
<html>
<body>
<h1 id="peter">My First Heading</h1>
<p><span style="color:red">My</span> first paragraph.</p>
<img src="peter.jpg" />
</body>
</html>
变成这样的标准裸机表示:
into a canoncial baremetal representation like this:
<html><body><h1></h1><p><span></span></p><img/></body></html>
即所有属性都被删除,并且标签内容不是其他标签.
i.e. all attributes removed, as well as tag contents that are not other tags.
我找到了一种从标记中删除属性的方法,但是在区分文本子节点和标记子节点时遇到了问题.
I found a way to remove attributes from tags, but im having problems differentiation between text child nodes and tag child nodes.
推荐答案
作为文档说,
所以我会选择这样的东西(假设soup
正是您发布的内容):
so I would go for something like this (assume soup
is exactly what you posted):
for e in soup.find_all(True):
e.attrs = {}
for i in e.contents:
if i.string:
i.string.replace_with('')
我认为,如果一个标签有一个以上的孩子,其中一个是文本,而另一个是另一个包含文本的标签,那么如果不遍历每个标签的内容,您最终会剩下一些文本残留(如您的示例) <p><span style="color:red">My</span> first paragraph.</p>
).
I think without looping into each tag's content you'll end up with some text leftovers in cases in which a tag has more than one child and one of them is text and another one is another tag containing text (as in your example <p><span style="color:red">My</span> first paragraph.</p>
).
针对您的示例运行时:
(env) $ python strip.py
<!DOCTYPE html>
<html><body><h1></h1><p><span></span></p><img/></body></html>
(可以稍作更改,因此不会返回换行符或doctype)
(it can be changed a little so it doesn't return newlines or doctype)
这篇关于BeautifulSoup删除标签属性和文本内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!