我想检查网址列表(在数据框df的一列中)以获取其状态代码(404、403和200似乎很有趣)。我定义了一个可以完成工作的函数。但是,它使用的for循环效率低下(我的网址列表很长!)。
有没有人暗示如何更有效地做到这一点?最佳地,返回的状态代码也将显示在数据框的新列中,例如df ['status_code_url']。
def url_access(df, column):
e_404 =0
e_403 =0
e_200 =0
for i in range(0, len(df)):
if requests.head(df[column][i]).status_code == 404:
e_404= e_404+1
elif requests.head(df[column][i]).status_code == 403:
e_403 = e_403 +1
elif requests.head(df[column][i]).status_code == 200:
e_200 = e_200 +1
else:
print(requests.head(df[column][i]).status_code)
return ("Statistics about " + column , '{:.1%}'.format(e_404/len(df))
+ " of links to intagram post return 404", '{:.1%}'.format(e_403/len(df))
+ " of links to intagram post return 403", '{:.1%}'.format(e_200/len(df))
+ " of links to intagram post return 200")
万分感谢!
最佳答案
pandas.DataFrame.apply
(或者说是普通的requests
库)一次只能发出一个请求。要并行执行多个请求,可以使用requests_futures
(通过pip install requests-futures
安装):
import pandas as pd
from requests_futures.sessions import FuturesSession
def get_request(url):
session = FuturesSession()
return session.head(url)
def get_status_code(r):
return r.result().status_code
if __name__ == "__main__":
urls = ['http://python-requests.org',
'http://httpbin.org',
'http://python-guide.org',
'http://kennethreitz.com']
df = pd.DataFrame({"url": urls})
df["status_code"] = df["url"].apply(get_request).apply(get_status_code)
之后,您可以使用@Aritesh在their answer中建议的
groupby
:stats = df.groupby('status_code')['url'].count().reset_index()
print(stats)
# status_code url
0 200 1
1 301 3
使用此功能,您可能还想添加一些保护措施,以防止连接错误和超时:
import numpy as np
import requests
def get_request(url):
session = FuturesSession()
return session.head(url, timeout=1)
def get_status_code(r):
try:
return r.result().status_code
except (requests.exceptions.ConnectionError, requests.exceptions.ReadTimeout):
return 408 # Request Timeout
ips = np.random.randint(0, 256, (1000, 4))
df = pd.DataFrame({"url": ["http://" + ".".join(map(str, ip)) for ip in ips]})
df["status_code"] = df["url"].apply(get_request).apply(get_status_code)
df.groupby('status_code')['url'].count().reset_index()
# status_code url
# 0 200 3
# 1 302 2
# 2 400 2
# 3 401 1
# 4 403 1
# 5 404 1
# 6 408 990