from bs4 import BeautifulSoup
import requests
s=requests.Session()
r=s.get('http://www.virginiaequestrian.com/main.cfm?action=greenpages&GPType=8')
soup=BeautifulSoup(r.text,'html5lib')

DataGrid=soup.find('tbody')
test=[]
for tr in DataGrid.find_all('tr')[:3]:
        for td in tr.find_all('td'):
            print td.string


嗨,我正在尝试为此网站(http://www.virginiaequestrian.com/main.cfm?action=greenpages&GPType=8)解析html并获取表数据。我试图从结果中排除前三个表行,但是由于某种原因,我无法让解析器执行此操作。这是我的第一次专业刮擦尝试,我完全不知道该如何工作。我猜想这可能与我使用的html5lib解析器有关,但老实说我不知道​​。有人可以告诉我如何使它工作吗?

作为一个很好的测试,将数据拉到前三行将非常有用。这样我可以确信完成的查询将从除这些之外的任何东西中拉取。

例如,表中的第一行将是“马术网站”

最佳答案

您只使用了前三个而不忽略[:3],它将切片列表中的前三个元素:

 DataGrid.find_all('tr')[:3] # first three elements


应该是DataGrid.find_all('tr')[3:]#除前三个元素外的所有元素

from bs4 import BeautifulSoup
import requests

r=requests.get('http://www.virginiaequestrian.com/main.cfm?action=greenpages&GPType=8')
soup=BeautifulSoup(r.content)

tbl = soup.find("table")
for tag in tbl.find_all("tr")[3:]:
    for td in tag.find_all('td'):
        print td.text


将上面的tbl.find_all("tr")切片并使用两个不同的解析器输出时:

In [20]: soup=BeautifulSoup(r.content,"html.parser")

In [21]: tbl = soup.find("table")

In [22]: len(tbl.find_all("tr"))
Out[22]: 364

In [23]: len(tbl.find_all("tr")[3:])
Out[23]: 361

In [24]: soup=BeautifulSoup(r.content,"lxml")

In [25]: tbl = soup.find("table")

In [26]: len(tbl.find_all("tr")[3:])
Out[26]: 361

In [27]: len(tbl.find_all("tr"))
Out[27]: 364


如果您确实需要more hrefs,那么您应该完全这样做,获取每个atr标记,实际需要的行之前还有6 tr,因此您需要跳过6:

tbl = soup.find("table")
out = (tag.find('a') for tag in tbl.find_all("tr")[6:])

for a in out:
    print(a["href"])


输出:

main.cfm?action=greenpages&sub=view&ID=9068
main.cfm?action=greenpages&sub=view&ID=9504
main.cfm?action=greenpages&sub=view&ID=10868
main.cfm?action=greenpages&sub=view&ID=10261
main.cfm?action=greenpages&sub=view&ID=10477
main.cfm?action=greenpages&sub=view&ID=10708
main.cfm?action=greenpages&sub=view&ID=11712
main.cfm?action=greenpages&sub=view&ID=12402
main.cfm?action=greenpages&sub=view&ID=12496
..................


要使用链接,只需在主网址前添加:

for a in out:
    print("http://www.virginiaequestrian.com/{}".format(a["href"]))


输出:

http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=9068
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=9504
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10868
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10261
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10477
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10708
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=11712
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=12402
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=12496
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=12633
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=13528


如果您打开第一个数据,将会带您到马术网站,即您想要的第一个数据。

10-06 07:48
查看更多