问题描述
如何检测字符串中是否包含html(可以是html4,html5,也可以是文本内的html部分)?我不需要HTML版本,但是如果字符串只是文本或包含html,则不需要.文本通常是多行,也有空行
How to detect either the string contains an html (can be html4, html5, just partials of html within text)? I do not need a version of HTML, but rather if the string is just a text or it contains an html. Text is typically multiline with also empty lines
示例输入:
html:
<head><title>I'm title</title></head>
Hello, <b>world</b>
非html:
<ht fldf d><
<html><head> head <body></body> html
推荐答案
您可以使用HTML解析器,例如 BeautifulSoup
.请注意,它确实会尽最大努力来解析HTML,甚至是损坏的HTML,根据基础解析器:
You can use an HTML parser, like BeautifulSoup
. Note that it really tries it best to parse an HTML, even broken HTML, it can be very and not very lenient depending on the underlying parser:
>>> from bs4 import BeautifulSoup
>>> html = """<html>
... <head><title>I'm title</title></head>
... </html>"""
>>> non_html = "This is not an html"
>>> bool(BeautifulSoup(html, "html.parser").find())
True
>>> bool(BeautifulSoup(non_html, "html.parser").find())
False
这基本上是尝试在字符串内查找任何html元素.如果找到-结果为True
.
This basically tries to find any html element inside the string. If found - the result is True
.
另一个带有HTML片段的示例:
Another example with an HTML fragment:
>>> html = "Hello, <b>world</b>"
>>> bool(BeautifulSoup(html, "html.parser").find())
True
或者,您可以使用 lxml.html
:
Alternatively, you can use lxml.html
:
>>> import lxml.html
>>> html = 'Hello, <b>world</b>'
>>> non_html = "<ht fldf d><"
>>> lxml.html.fromstring(html).find('.//*') is not None
True
>>> lxml.html.fromstring(non_html).find('.//*') is not None
False
这篇关于如果字符串包含html代码,如何用python检测?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!