python - Python使用lxml解析html:在特定符号引起问题的同时获取标记的文本

我正在使用lxml解析真实世界的HTML文件。这意味着，我想从标签中提取信息，而我无法控制样式。
我遇到的问题在于数据内。

<fieldset>
  <legend>
    <strong>Notes</strong>
  </legend>
  Slav *kǫda 'thither', kǫdě   'where, whither' < IE *k(w)om-d(h)
</fieldset>

问题是由于数据中的符号有什么我可以应用的解决方案来使文本从该标记中删除吗？

最佳答案

HTML实际上是broken one。

您可以使用BeautifulSoup和宽大的html5lib解析器来解析它：

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup


data = u"""
<fieldset>
  <legend>
    <strong>Notes</strong>
  </legend>
  Slav *kǫda 'thither', kǫdě   'where, whither' < IE *k(w)om-d(h)
</fieldset>
"""

soup = BeautifulSoup(data, "html5lib")
print(soup.fieldset.legend.next_sibling.strip())

印刷品：

Slav *kǫda 'thither', kǫdě   'where, whither' < IE *k(w)om-d(h)

关于python - Python使用lxml解析html:在特定符号引起问题的同时获取标记的文本，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/33786869/