问题描述
我需要将两个元素中的文本转换为字符串:
$ b
source_code = < span class =UserName>< a href =#> Martin Elias< / a>< / span>
>>> ; text
'Martin Elias'
我怎么能做到这一点?
我搜索了python parse html,这是第一个结果:
这段代码是从python文档中获得的
from HTMLParser import HTMLParser
#创建一个子类并重写处理方法
class MyHTMLParser(HTMLParser):
def handle_starttag(self,tag,attrs):
print遇到一个开始标记:,标记
def handle_endtag( self,tag):
print遇到一个结束标记:,标记
def handle_data(self,data):
print遇到一些数据:,数据
#实例化解析器并为其提供一些HTML
解析器= MyHTMLParser()
parser.feed('< html> < head>< title> Test< / title>< / head>'
'< body>< h1>解析我!< / h1>< / body>< / html> ')
以下是结果:
<$遇到一个开始标签:html
遇到一个开始标签:head
遇到一个开始标签:title
遇到一些数据:Test
遇到一个结束标签:title
遇到一个结束标签:head
遇到一个开始标签:body
遇到一个开始标签:h1
遇到一些数据:解析我!
遇到一个结束标记:h1
遇到一个结束标记:body
遇到一个结束标记:html
通过查看HTMLParser中的代码,我想出了这个:
class myhtmlparser(HTMLParser):
def __init __(self):
self.reset()
self.NEWTAGS = []
self.NEWATTRS = []
self。 HTMLDATA = []
def handle_starttag(self,tag,attrs):
self.NEWTAGS.append(tag)
self.NEWATTRS.append(attrs)
def handle_data(self ,数据):
self.HTMLDATA.append(data)
def clean(self):
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []
您可以像这样使用它:
from HTMLParser import HTMLParser
pstring = source_code =< span class =UserName>< a href = #> Martin Elias< / a>< / span>
class myhtmlparser(HTMLParser):
def __init __(self):
self.reset()
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []
def handle_starttag(self,tag,attrs):
self.NEWTAGS.append(tag)
self.NEWATTRS.append(attrs)
def handle_data(self,data):
self.HTMLDATA.append(data)
def clean(self):
self.NEWTAGS = []
self.NEWATTRS = [ ]
self.HTMLDATA = []
parser = myhtmlparser()
parser.feed(pstring)
#从解析器中提取数据
tags = parser.NEWTAGS
attrs = parser.NEWATTRS
data = parser.HTMLDATA
#清理解析器
parser.clean()
#打印出我们的数据
打印标签
打印attrs
打印数据
现在您应该能够轻松地从这些列表中提取数据。我希望这有助于!
I need to get the text inside the two elements into a string:
source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>"""
>>> text
'Martin Elias'
How could I achieve this?
I searched "python parse html" and this was the first result:https://docs.python.org/2/library/htmlparser.html
This code is taken from the python docs
from HTMLParser import HTMLParser
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered a start tag:", tag
def handle_endtag(self, tag):
print "Encountered an end tag :", tag
def handle_data(self, data):
print "Encountered some data :", data
# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')
Here is the result:
Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html
Using this and by looking at the code in HTMLParser I came up with this:
class myhtmlparser(HTMLParser):
def __init__(self):
self.reset()
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []
def handle_starttag(self, tag, attrs):
self.NEWTAGS.append(tag)
self.NEWATTRS.append(attrs)
def handle_data(self, data):
self.HTMLDATA.append(data)
def clean(self):
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []
You can use it like this:
from HTMLParser import HTMLParser
pstring = source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>"""
class myhtmlparser(HTMLParser):
def __init__(self):
self.reset()
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []
def handle_starttag(self, tag, attrs):
self.NEWTAGS.append(tag)
self.NEWATTRS.append(attrs)
def handle_data(self, data):
self.HTMLDATA.append(data)
def clean(self):
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []
parser = myhtmlparser()
parser.feed(pstring)
# Extract data from parser
tags = parser.NEWTAGS
attrs = parser.NEWATTRS
data = parser.HTMLDATA
# Clean the parser
parser.clean()
# Print out our data
print tags
print attrs
print data
Now you should be able to extract your data from those lists easily. I hope this helped!
这篇关于解析HTML以在元素中获取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!