问题描述
(Python 2.7版,BeautifulSoup4)
(python 2.7, BeautifulSoup4)
我试图提取SECñ-Q文件,表格内容。示例HTML浏览:的
I am trying to extract the table contents from SEC N-Q documents. Sample html here: https://www.sec.gov/Archives/edgar/data/36405/000093247115006447/indexfunds_final.htm
该文件没有标签的。我想搜索一节C.期货合约,并寻找下一个<表>并提取上述&lt内容; TR>。有多个C.期货合约中出现一个文档了。
The file has no tag at all. I want to search for section 'C. Futures Contract' and look for the next < table > and extract the contents in < tr >. There are multiple 'C. Futures Contract' occurrences in one document too.
我试过以下code,但一无所获。
I've tried the following code but got nothing.
import requests, re
from bs4 import BeautifulSoup
r = requests.get("https://www.sec.gov/Archives/edgar/data/36405/000093247115006447/indexfunds_final.htm")
futures = soup.find_all(re.compile('C. Futures Contract'))
print futures
[]
推荐答案
首先,如果你是文本搜索,使用文本
参数(从BS 4.4起。 0参数被命名为)。
First of all, if you are searching by text, use text
argument (starting from bs 4.4.0 the argument is named string
).
除此之外,对于每一个期货
部分,使用的寻找下一个表
元素。
Aside from that, for every futures
section, use find_next()
to find the next table
element.
工作code:
import re
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.sec.gov/Archives/edgar/data/36405/000093247115006447/indexfunds_final.htm")
soup = BeautifulSoup(response.content)
futures = soup.find_all(text=re.compile('C. Futures Contract'))
for future in futures:
for row in future.find_next("table").find_all("tr"):
print [cell.get_text(strip=True) for cell in row.find_all("td")]
这篇关于如何从SECñ-Q文档使用BeautifulSoup提取表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!