问题描述
我正在尝试编写一个从
注意:无论何时进行抓取,请始终关闭 JS
(JavaScript).BeautifulSoup
看不到动态呈现的内容.这样你就不会得到任何回报,因为如果没有 JS
,你所追求的标签的类是不同的.
I am trying to write a scraper that extracts a table from this wikipedia page.The problem is, I can extract all tables on the page EXCEPT the one I actually need (which is the table containing the stats of all the election that has ever been conducted in the United States). I do not think the problem is with my tag.
Here is my code
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
from urllib.request import urlopen
#getting the wiki page
page_info=urlopen('https://en.wikipedia.org/wiki/United_States_presidential_election')
soup=BeautifulSoup(page_info, 'html.parser')
headline=soup.find('table', "wikitable sortable jquery-tablesorter")
print(headline)
I think there is something crucial I am missing, but I can not wrap my head around it. Can someone help me please.
One way of doing this would be:
import pandas as pd
import requests
from bs4 import BeautifulSoup
page = requests.get('https://en.wikipedia.org/wiki/United_States_presidential_election').text
soup = BeautifulSoup(page, 'html.parser')
table = soup.find('table', class_="wikitable sortable")
df = pd.read_html(str(table))
df = pd.concat(df)
print(df)
df.to_csv("elections.csv", index=False)
Which outputs:
Year Party ... Electoral votes Notes
0 1788 Independent ... 69 / 138 NaN
1 1788 Federalist ... 34 / 138 NaN
2 1788 Federalist ... 9 / 138 NaN
3 1788 Federalist ... 6 / 138 NaN
4 1788 Federalist ... 6 / 138 NaN
.. ... ... ... ... ...
[219 rows x 8 columns]
Or a .csv
file that looks like this:
Note: Whenever you're scraping, always turn JS
(JavaScript) off. BeautifulSoup
doesn't see dynamically rendered content. That's way you're not getting anything back, because without JS
the class of the tag you're after is different.
这篇关于如何使用美丽的汤从维基百科中提取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!