Input:
<p>
<milestone n="14" unit="verse" />
The name of the third river is
<placeName key="tgn,1130850" authname="tgn,1130850">Hiddekel</placeName>: this is the one which flows in front of Assyria. The fourth
river is the <placeName key="tgn,1123842" authname="tgn,1123842">Euphrates</placeName>.
</p>
所需输出:
<p>
<milestone n="14" unit="verse" />
The name of the third river is Hiddekel: this is the one which flows in front of Assyria. The fourth river is the Euphrates.
</p>
嗨,我想找到一种从子元素(
placeName
)中提取文本并将其放回较大文本体内的方法。我在XML文件的其他地方也遇到过类似的问题,例如人名问题。我希望能够在不脱离里程碑的情况下提取姓名和地点。谢谢您的帮助!当前代码:
for p in chapter.findall('p'):
i = 1
for text in p.itertext():
file.write(body.attrib["n"] + " " + chapter.attrib["n"] + ":" + str(i) + text)
i = i + 1
最佳答案
可以使用beautifulsoup和unwrap()
方法完成:
from bs4 import BeautifulSoup as bs
snippet = """your html above"""
soup = bs(snippet,'lxml')
pl = soup.find_all('placename')
for p in pl:
p.unwrap()
soup
输出:
<html><body><p>
<milestone n="14" unit="verse"></milestone>
The name of the third river is
Hiddekel: this is the one which flows in front of Assyria. The fourth
river is the Euphrates.
</p>
</body></html>
关于python - 使用Python解析XML:保留属性内的文本,同时删除属性周围的标签,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/59865426/