本文介绍了在python中将html表转换为csv的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试从动态页面中抓取表格.在下面的代码(需要硒)之后,我设法获取了<table>
元素的内容.
I'm trying to scrape a table from a dynamic page. After the following code (requires selenium), I manage to get the contents of the <table>
elements.
我想将此表转换为csv,并且尝试了2种方法,但均失败了:
I'd like to convert this table into a csv and I have tried 2 things, but both fail:
-
pandas.read_html
返回一个错误,说我没有安装html5lib,但是确实可以,并且我可以毫无问题地将其导入. 我运行 -
soup.find_all('tr')
返回错误'NoneType' object is not callable
soup = BeautifulSoup(tablehtml)
后,pandas.read_html
returns an error saying I don't have html5lib installed, but I do and in fact I can import it without problems.soup.find_all('tr')
returns an error'NoneType' object is not callable
after I runsoup = BeautifulSoup(tablehtml)
这是我的代码:
import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
import pandas as pd
main_url = "http://data.stats.gov.cn/english/easyquery.htm?cn=E0101"
driver = webdriver.Firefox()
driver.get(main_url)
time.sleep(7)
driver.find_element_by_partial_link_text("Industry").click()
time.sleep(7)
driver.find_element_by_partial_link_text("Main Economic Indicat").click()
time.sleep(6)
driver.find_element_by_id("mySelect_sj").click()
time.sleep(2)
driver.find_element_by_class_name("dtText").send_keys("last72")
time.sleep(3)
driver.find_element_by_class_name("dtTextBtn").click()
time.sleep(2)
table=driver.find_element_by_id("table_main")
tablehtml= table.get_attribute('innerHTML')
推荐答案
在无法访问您实际上要抓取的表的情况下,我使用了以下示例:
Without access to the table you're actually trying to scrape, I used this example:
<table>
<thead>
<tr>
<td>Header1</td>
<td>Header2</td>
<td>Header3</td>
</tr>
</thead>
<tr>
<td>Row 11</td>
<td>Row 12</td>
<td>Row 13</td>
</tr>
<tr>
<td>Row 21</td>
<td>Row 22</td>
<td>Row 23</td>
</tr>
<tr>
<td>Row 31</td>
<td>Row 32</td>
<td>Row 33</td>
</tr>
</table>
并使用以下内容将其抓取:
and scraped it using:
from bs4 import BEautifulSoup as BS
content = #contents of that table
soup = BS(content, 'html5lib')
rows = [tr.findAll('td') for tr in soup.findAll('tr')]
此行对象是列表列表:
[
[<td>Header1</td>, <td>Header2</td>, <td>Header3</td>],
[<td>Row 11</td>, <td>Row 12</td>, <td>Row 13</td>],
[<td>Row 21</td>, <td>Row 22</td>, <td>Row 23</td>],
[<td>Row 31</td>, <td>Row 32</td>, <td>Row 33</td>]
]
...,您可以将其写入文件:
...and you can write it to a file:
for it in rows:
with open('result.csv', 'a') as f:
f.write(", ".join(str(e).replace('<td>','').replace('</td>','') for e in it) + '\n')
看起来像这样:
Header1, Header2, Header3
Row 11, Row 12, Row 13
Row 21, Row 22, Row 23
Row 31, Row 32, Row 33
这篇关于在python中将html表转换为csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!