问题描述
我有一个包含以下字符串的 XML 文件:
I have an XML file which contains the following strings:
<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5>2 and 3<5 and try to remove non xml compatible characters.</field>
在 XML 的正文中,我有 >
和 <
字符,这与 XML 规范不兼容.我需要替换它们,以便当 >
和 <
在:
In the body of the XML, I have >
and <
characters, which are not compatible with the XML specification. I need to replace them such that when >
and <
are in:
' "> '
' " > ' and
' </ '
分别,它们应该NOT被替换,所有其他出现的>
和<
应该被替换为字符串大于"和小于".所以结果应该是这样的:
respectively, they should NOT be replaced, all other occurrence of >
and <
should be replaced by strings "greater than" and "less than". So the result should be like:
<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5 greater than 2 and 3 less than 5 and try to remove non xml compatible characters.</field>
如何使用 Python 做到这一点?
How can I do that with Python?
推荐答案
You can use lxml.etree.XMLParser
with recover=True
option:
You could use lxml.etree.XMLParser
with recover=True
option:
import sys
from lxml import etree
invalid_xml = """
<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5>2 and 3<5 and
try to remove non xml compatible characters.</field>
"""
root = etree.fromstring("<root>%s</root>" % invalid_xml,
parser=etree.XMLParser(recover=True))
root.getroottree().write(sys.stdout)
输出
<root>
<field name="id">abcdef</field>
<field name="intro"> pqrst</field>
<field name="desc"> this is a test file. We will show 5>2 and 35 and
try to remove non xml compatible characters.</field>
</root>
注意:>
留在文档中作为 >
和 <
被完全删除(作为 xml 中的无效字符文本).
Note: >
is left in the document as >
and <
is completely removed (as invalid character in xml text).
对于简单的类似 xml 的内容,您可以使用 re.split()
将标签与文本分开并在非标签文本区域进行替换:
For simple xml-like content you could use re.split()
to separate tags from the text and make the substitutions in non-tag text regions:
import re
from itertools import izip_longest
from xml.sax.saxutils import escape # '<' -> '<'
# assumptions:
# doc = *( start_tag / end_tag / text )
# start_tag = '<' name *attr [ '/' ] '>'
# end_tag = '<' '/' name '>'
ws = r'[ \t\r\n]*' # allow ws between any token
name = '[a-zA-Z]+' # note: expand if necessary but the stricter the better
attr = '{name} {ws} = {ws} "[^"]*"' # note: fragile against missing '"'; no "'"
start_tag = '< {ws} {name} {ws} (?:{attr} {ws})* /? {ws} >'
end_tag = '{ws}'.join(['<', '/', '{name}', '>'])
tag = '{start_tag} | {end_tag}'
assert '{{' not in tag
while '{' in tag: # unwrap definitions
tag = tag.format(**vars())
tag_regex = re.compile('(%s)' % tag, flags=re.VERBOSE)
# escape &, <, > in the text
iters = [iter(tag_regex.split(invalid_xml))] * 2
pairs = izip_longest(*iters, fillvalue='') # iterate 2 items at a time
print(''.join(escape(text) + tag for text, tag in pairs))
为了避免标签误报,您可以删除上面的一些 '{ws}'
.
To avoid false positives for tags you could remove some of '{ws}'
above.
<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5>2 and 3<5 and
try to remove non xml compatible characters.</field>
注意:<>
在文本中都被转义了.
Note: both <>
are escaped in the text.
你可以调用任何函数而不是上面的 escape(text)
例如,
You could call any function instead of escape(text)
above e.g.,
def escape4human(text):
return text.replace('<', 'less than').replace('>', 'greater than')
这篇关于在 Python 中匹配模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!