本文介绍了如何从SECñ-Q文档使用BeautifulSoup提取表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(Python 2.7版,BeautifulSoup4)

(python 2.7, BeautifulSoup4)

我试图提取SECñ-Q文件,表格内容。示例HTML浏览:的

I am trying to extract the table contents from SEC N-Q documents. Sample html here: https://www.sec.gov/Archives/edgar/data/36405/000093247115006447/indexfunds_final.htm

该文件没有标签的。我想搜索一节C.期货合约,并寻找下一个<表>并提取上述&lt内容; TR>。有多个C.期货合约中出现一个文档了。

The file has no tag at all. I want to search for section 'C. Futures Contract' and look for the next < table > and extract the contents in < tr >. There are multiple 'C. Futures Contract' occurrences in one document too.

我试过以下code,但一无所获。

I've tried the following code but got nothing.

import requests, re
from bs4 import BeautifulSoup
r = requests.get("https://www.sec.gov/Archives/edgar/data/36405/000093247115006447/indexfunds_final.htm")
futures = soup.find_all(re.compile('C. Futures Contract'))
print futures

[]

推荐答案

首先,如果你是文本搜索,使用文本参数(从BS 4.4起。 0参数被命名为)。

First of all, if you are searching by text, use text argument (starting from bs 4.4.0 the argument is named string).

除此之外,对于每一个期货部分,使用的寻找下一个元素。

Aside from that, for every futures section, use find_next() to find the next table element.

工作code:

import re

import requests
from bs4 import BeautifulSoup

response = requests.get("https://www.sec.gov/Archives/edgar/data/36405/000093247115006447/indexfunds_final.htm")
soup = BeautifulSoup(response.content)

futures = soup.find_all(text=re.compile('C. Futures Contract'))
for future in futures:
    for row in future.find_next("table").find_all("tr"):
        print [cell.get_text(strip=True) for cell in row.find_all("td")]

这篇关于如何从SECñ-Q文档使用BeautifulSoup提取表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-28 11:24