本文介绍了在 Python 中匹配模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含以下字符串的 XML 文件:

I have an XML file which contains the following strings:

<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5>2 and 3<5 and try to remove non xml compatible characters.</field>

在 XML 的正文中,我有 >< 字符,这与 XML 规范不兼容.我需要替换它们,以便当 >< 在:

In the body of the XML, I have > and < characters, which are not compatible with the XML specification. I need to replace them such that when > and < are in:

 ' "> '
 ' " > ' and
 ' </ '

分别,它们应该NOT被替换,所有其他出现的><应该被替换为字符串大于"和小于".所以结果应该是这样的:

respectively, they should NOT be replaced, all other occurrence of > and < should be replaced by strings "greater than" and "less than". So the result should be like:

 <field name="id">abcdef</field>
 <field name="intro" > pqrst</field>
 <field name="desc"> this is a test file. We will show 5 greater than 2 and 3 less than 5 and try to remove non xml compatible characters.</field>

如何使用 Python 做到这一点?

How can I do that with Python?

推荐答案

You can use lxml.etree.XMLParser with recover=True option:

You could use lxml.etree.XMLParser with recover=True option:

import sys
from lxml import etree

invalid_xml = """
<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5>2 and 3<5 and
try to remove non xml compatible characters.</field>
"""
root = etree.fromstring("<root>%s</root>" % invalid_xml,
                        parser=etree.XMLParser(recover=True))
root.getroottree().write(sys.stdout)

输出

<root>
<field name="id">abcdef</field>
<field name="intro"> pqrst</field>
<field name="desc"> this is a test file. We will show 5&gt;2 and 35 and
try to remove non xml compatible characters.</field>
</root>

注意:> 留在文档中作为 &gt;< 被完全删除(作为 xml 中的无效字符文本).

Note: > is left in the document as &gt; and < is completely removed (as invalid character in xml text).

对于简单的类似 xml 的内容,您可以使用 re.split() 将标签与文本分开并在非标签文本区域进行替换:

For simple xml-like content you could use re.split() to separate tags from the text and make the substitutions in non-tag text regions:

import re
from itertools import izip_longest
from xml.sax.saxutils import escape  # '<' -> '&lt;'

# assumptions:
#   doc = *( start_tag / end_tag / text )
#   start_tag = '<' name *attr [ '/' ] '>'
#   end_tag = '<' '/' name '>'
ws = r'[ \t\r\n]*'  # allow ws between any token
name = '[a-zA-Z]+'  # note: expand if necessary but the stricter the better
attr = '{name} {ws} = {ws} "[^"]*"'  # note: fragile against missing '"'; no "'"
start_tag = '< {ws} {name} {ws} (?:{attr} {ws})* /? {ws} >'
end_tag = '{ws}'.join(['<', '/', '{name}', '>'])
tag = '{start_tag} | {end_tag}'

assert '{{' not in tag
while '{' in tag: # unwrap definitions
    tag = tag.format(**vars())

tag_regex = re.compile('(%s)' % tag, flags=re.VERBOSE)

# escape &, <, > in the text
iters = [iter(tag_regex.split(invalid_xml))] * 2
pairs = izip_longest(*iters, fillvalue='')  # iterate 2 items at a time
print(''.join(escape(text) + tag for text, tag in pairs))

为了避免标签误报,您可以删除上面的一些 '{ws}'.

To avoid false positives for tags you could remove some of '{ws}' above.

<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5&gt;2 and 3&lt;5 and
try to remove non xml compatible characters.</field>

注意:<> 在文本中都被转义了.

Note: both <> are escaped in the text.

你可以调用任何函数而不是上面的 escape(text) 例如,

You could call any function instead of escape(text) above e.g.,

def escape4human(text):
    return text.replace('<', 'less than').replace('>', 'greater than')

这篇关于在 Python 中匹配模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-12 12:33