问题描述
我正在尝试用 python 抓取网页.我能够轻松获取单行标签的结果,但对于分布在多行上的标签,我的代码无法检索任何内容.
I am trying to scrape a webpage in python. I was able to easily get the results for tags which were on a single line, but for tags spread over multiple lines, my code cannot retrieve anything.
在 HTML 源代码中,单行标签显示为:
In the HTML source single line tags are present as:
<td><span class="facultyName">John Matthew Falletta, MD</span>
并且多个行标记存在为:
and multiple line tags are present as:
<td><span class="label">Division:</span>
</td><td>Hematology/Oncology</td>
这是我写的:
patFinderFullname = re.compile('<span class="facultyName">(.*)</span>')
fullname = re.findall(patFinderFullname,webpage) #works fine
patFinderDivision = re.compile('<span class="label">Division:</span> </td><td>(.*)</td>')
division = re.findall(patFinderDivision,webpage) #doesn't work
这里我的网页变量包含必须被抓取的网址.有人可以指出,我缺少什么,或者我错在哪里?
Here my webpage variable contains the url which has to be scraped. Can someone point out, what I am missing, or where I am wrong?
推荐答案
我强烈建议您使用 BeautifulSoup.它是一个用于解析 HTML 文档的 Python 库.
I highly recommend you use BeautifulSoup.It is a Python library for parsing HTML documents.
P.s:如果您想坚持使用自己的代码,请使用 \s* 跳过正则表达式中的空格.
P.s: If you want to stick with your own code, use \s* to skip white spaces in regex.
patFinderDivision = re.compile('<span class="label">Division:</span>\s* \s*</td><td>(.*)</td>')
这篇关于如何在python中抓取分布在多行上的html标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!