问题描述
在 Python 中,我希望从字符串中删除所有"(第一次出现的除外).
In Python, I'm looking to remove all the "<html>
" from a string (except for the first occurrence).
此外,我希望从字符串中删除所有</html>
"(最后一次出现除外).
Also, I'm looking to remove all the "</html>
" from a string (except for the last occurrence).
可以是大写的,所以我需要它不区分大小写.
<html>
can be uppercase, so I need it to be case insensitive.
我最好的方法是什么?
推荐答案
这个解决方案使用了两个正则表达式.第一个正则表达式将整个文件/字符串分成三个块:
This solution uses two regexes. The first regex splits the entire file/string into three chunks:
- 第一个块(捕获到
$1
组中)是从字符串的开头到并包括第一个 HTML 开始标记的所有内容. - 第二个块(捕获到
$2
组中)是从第一个 HTML 开始标记到最后一个 HTML 结束标记开始的所有内容. - 第三个块(捕获到
$3
组中)包括最后一个 HTML 结束标记以及文件/字符串末尾的所有内容.
- The first chunk, (captured into group
$1
) is everything from the start of the string up through and including the first HTML start tag. - The second chunk, (captured into group
$2
) is everything after the first HTML start tag up to the start of the last HTML close tag. - The third chunk, (captured into group
$3
) includes the last HTML end tag and everything that follows up to the end of the file/string.
该函数首先尝试将正则表达式与输入文本进行匹配.如果匹配,则最外层 HTML 元素(之前在第 2 组中捕获)的内容将使用第二个正则表达式去除任何 HTML 开始和结束标记.然后使用三个块重新组合该字符串(中间块已去除 HTML 标签).
The function first attempts to match the regex to the input text. If this matches, the contents of the outermost HTML element (which was previously captured in group 2) are then stripped of any HTML start and end tags using the second regex. The string is then reassembled using the three chunks (with the middle chunk having been stripped of HTML tags).
def stripInnermostHTMLtags(text):
'''Strip all but outermost HTML start and end tags.
'''
# Regex to match outermost HTML element and its contents.
p_outer = re.compile(r"""
^ # Anchor to start of string.
(.*?<html[^>]*>) # $1: Outer HTML start tag.
(.*) # $2: Outer HTML element contents.
(</html\s*>.*) # $3: Outer HTML end tag.
$ # Anchor to end of string.
""", re.DOTALL | re.VERBOSE | re.IGNORECASE)
# Split text into outermost HTML tags and its contents.
m = p_outer.match(text)
if m:
# Regex to match HTML element start or end tag.
p_inner = re.compile("</?html[^>]*>", re.IGNORECASE)
# Strip contents of any/all HTML start and end tags.
contents = p_inner.sub("", m.group(2))
# Put string back together stripped of inner HTML tags.
text = m.group(1) + contents + m.group(3)
return text
请注意,此解决方案可以正确处理 HTML 开始标记中可能存在的任何属性.另请注意,此解决方案不处理属性值包含 >
字符的 HTML 标签(但这应该很少见).
Note that this solution correctly handles any attributes that may be in the HTML start tags. Note also that this solution does NOT handle HTML tags having attributes with values containing the >
character (but this should be very rare).
这篇关于删除字符串中除第一次出现之外的所有出现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!