如何使用美丽的汤从维基百科中提取表格

如何使用美丽的汤从维基百科中提取表格

本文介绍了如何使用美丽的汤从维基百科中提取表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个从

注意:无论何时进行抓取,请始终关闭 JS (JavaScript).BeautifulSoup 看不到动态呈现的内容.这样你就不会得到任何回报,因为如果没有 JS,你所追求的标签的类是不同的.

I am trying to write a scraper that extracts a table from this wikipedia page.The problem is, I can extract all tables on the page EXCEPT the one I actually need (which is the table containing the stats of all the election that has ever been conducted in the United States). I do not think the problem is with my tag.
Here is my code

from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
from urllib.request import urlopen

#getting the wiki page
page_info=urlopen('https://en.wikipedia.org/wiki/United_States_presidential_election')

soup=BeautifulSoup(page_info, 'html.parser')

headline=soup.find('table', "wikitable sortable jquery-tablesorter")
print(headline)

I think there is something crucial I am missing, but I can not wrap my head around it. Can someone help me please.

解决方案

One way of doing this would be:

import pandas as pd
import requests
from bs4 import BeautifulSoup


page = requests.get('https://en.wikipedia.org/wiki/United_States_presidential_election').text
soup = BeautifulSoup(page, 'html.parser')
table = soup.find('table', class_="wikitable sortable")

df = pd.read_html(str(table))
df = pd.concat(df)
print(df)
df.to_csv("elections.csv", index=False)

Which outputs:

     Year                                    Party  ... Electoral votes      Notes
0    1788                              Independent  ...        69 / 138        NaN
1    1788                               Federalist  ...        34 / 138        NaN
2    1788                               Federalist  ...         9 / 138        NaN
3    1788                               Federalist  ...         6 / 138        NaN
4    1788                               Federalist  ...         6 / 138        NaN
..    ...                                      ...  ...             ...        ...
[219 rows x 8 columns]

Or a .csv file that looks like this:

Note: Whenever you're scraping, always turn JS (JavaScript) off. BeautifulSoup doesn't see dynamically rendered content. That's way you're not getting anything back, because without JS the class of the tag you're after is different.

这篇关于如何使用美丽的汤从维基百科中提取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 16:11