python - 解析包含默认 namespace 的xml以使用lxml获取元素值

我有这样的xml字符串

str1 = """<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
    <loc>
        http://www.example.org/sitemap_1.xml.gz
    </loc>
    <lastmod>2015-07-01</lastmod>
</sitemap>
</sitemapindex> """

我想提取<loc>节点内存在的所有URL
即http://www.example.org/sitemap_1.xml.gz
我尝试了这段代码，但没有发声

from lxml import etree
root = etree.fromstring(str1)
urls = root.xpath("//loc/text()")
print urls
[]

我试图检查我的根节点格式是否正确。我尝试了这个并获得了与str1相同的字符串

etree.tostring(root)

'<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n<sitemap>\n<loc>http://www.example.org/sitemap_1.xml.gz</loc>\n<lastmod>2015-07-01</lastmod>\n</sitemap>\n</sitemapindex>'

最佳答案

在处理具有默认 namespace 的XML时，这是一个常见错误。您的XML具有默认的命名空间，在此声明为无前缀的命名空间:

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

请注意，除非另外声明(使用显式 namespace 前缀或指向不同 namespace uri的本地默认 namespace )，否则不仅声明了默认 namespace 的元素在该 namespace 中，而且所有后代元素都隐式继承祖先默认 namespace 。这意味着，在这种情况下，包括loc在内的所有元素都位于默认 namespace 中。

要选择 namespace 中的元素，您需要定义 namespace 映射的前缀，并在XPath中正确使用该前缀:

from lxml import etree
str1 = '''<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
    <loc>
        http://www.example.org/sitemap_1.xml.gz
    </loc>
    <lastmod>2015-07-01</lastmod>
</sitemap>
</sitemapindex>'''
root = etree.fromstring(str1)

ns = {"d" : "http://www.sitemaps.org/schemas/sitemap/0.9"}
url = root.xpath("//d:loc", namespaces=ns)[0]
print etree.tostring(url)

输出:

<loc xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
        http://www.example.org/sitemap_1.xml.gz
    </loc>

关于python - 解析包含默认 namespace 的xml以使用lxml获取元素值，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/31177707/