问题描述
我正在使用BeautifulSoup读取,修改和写入XML文件.我在删除CDATA节方面遇到麻烦.这是一个简化的示例.
I'm using BeautifulSoup to read, modify, and write an XML file. I'm having trouble with CDATA sections being stripped out. Here's a simplified example.
罪魁祸首XML文件:
<?xml version="1.0" ?>
<foo>
<bar><![CDATA[
!@#$%^&*()_+{}|:"<>?,./;'[]\-=
]]></bar>
</foo>
这是Python脚本.
And here's the Python script.
from bs4 import BeautifulSoup
xmlfile = open("cdata.xml", "r")
soup = BeautifulSoup( xmlfile, "xml" )
print(soup)
这是输出.请注意,缺少CDATA部分标签.
Here's the output. Note the CDATA section tags are missing.
<?xml version="1.0" encoding="utf-8"?>
<foo>
<bar>
!@#$%^&*()_+{}|:"<>?,./;'[]\-=
</bar>
</foo>
我还尝试打印soup.prettify(formatter="xml")
,并且在空白处略有不同,但得到的结果相同.在文档中,关于读取CDATA部分的内容不多,所以也许这是lxml
事情?
I also tried printing soup.prettify(formatter="xml")
and got the same result with slightly different whitespace. There isn't much in the docs about reading in CDATA sections, so maybe this is an lxml
thing?
有没有办法告诉BeautifulSoup保存CDATA节?
Is there a way to tell BeautifulSoup to preserve CDATA sections?
更新是的,这是lxml. http://lxml.de/api.html#cdata 因此,问题就变成了可以告诉BeautifulSoup用strip_cdata=False
初始化lxml吗?
Update Yes, it's an lxml thing. http://lxml.de/api.html#cdata So, the question becomes, is it possible to tell BeautifulSoup to initialize lxml with strip_cdata=False
?
推荐答案
就我而言,如果我使用
soup = BeautifulSoup( xmlfile, "lxml-xml" )
然后cdata被保留并可以访问.
then cdata is preserved and accesible.
这篇关于BeautifulSoup可以保留CDATA节吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!