我写了一段代码,只抓取以.ecm结尾的超链接,这是我的代码

_URL='http://www.thehindu.com/archive/web/2017/08/08/'
r = requests.get(_URL)
soup = BeautifulSoup(r.text)
urls = []
names = []
newpath=r'D:\fyp\data set'
os.chdir(newpath)
name='testecmlinks'
for i, link in enumerate(soup.findAll('a')):
    _FULLURL = _URL + link.get('href')
    if _FULLURL.endswith('.ece'):
        urls.append(_FULLURL)
        names.append(soup.select('a')[i].attrs['href'])

names_urls = zip(names, urls)

for name, url in names_urls:
    print url
    rq = urllib2.Request(url)
    res = urllib2.urlopen(rq)
    pdf = open(name+'.txt', 'wb')
    pdf.write(res.read())
    pdf.close()


但是我收到以下错误

Traceback (most recent call last):
  File "D:/fyp/scripts/test.py", line 18, in <module>
    _FULLURL = _URL + link.get('href')
TypeError: cannot concatenate 'str' and 'NoneType' objects


您能帮我得到以.ece结尾的超链接吗?

最佳答案

试试这个吧。希望您能从该页面获得所有以.ece结尾的超链接。

import requests
from bs4 import BeautifulSoup

response = requests.get("http://www.thehindu.com/archive/web/2017/08/08/").text
soup = BeautifulSoup(response,"lxml")
for link in soup.select("a[href$='.ece']"):
    print(link.get('href'))

关于python - 使用beautifulsoup仅从HTML页面中抓取以.ece结尾的超链接,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/48145958/

10-10 00:02