我写了一段代码,只抓取以.ecm
结尾的超链接,这是我的代码
_URL='http://www.thehindu.com/archive/web/2017/08/08/'
r = requests.get(_URL)
soup = BeautifulSoup(r.text)
urls = []
names = []
newpath=r'D:\fyp\data set'
os.chdir(newpath)
name='testecmlinks'
for i, link in enumerate(soup.findAll('a')):
_FULLURL = _URL + link.get('href')
if _FULLURL.endswith('.ece'):
urls.append(_FULLURL)
names.append(soup.select('a')[i].attrs['href'])
names_urls = zip(names, urls)
for name, url in names_urls:
print url
rq = urllib2.Request(url)
res = urllib2.urlopen(rq)
pdf = open(name+'.txt', 'wb')
pdf.write(res.read())
pdf.close()
但是我收到以下错误
Traceback (most recent call last):
File "D:/fyp/scripts/test.py", line 18, in <module>
_FULLURL = _URL + link.get('href')
TypeError: cannot concatenate 'str' and 'NoneType' objects
您能帮我得到以
.ece
结尾的超链接吗? 最佳答案
试试这个吧。希望您能从该页面获得所有以.ece
结尾的超链接。
import requests
from bs4 import BeautifulSoup
response = requests.get("http://www.thehindu.com/archive/web/2017/08/08/").text
soup = BeautifulSoup(response,"lxml")
for link in soup.select("a[href$='.ece']"):
print(link.get('href'))
关于python - 使用beautifulsoup仅从HTML页面中抓取以.ece结尾的超链接,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/48145958/