问题描述
对于给定的以分钟为单位的预计通勤旅程持续时间,我想知道我可能期望的实际通勤时间范围.例如,如果Google Maps预测我的通勤时间为20分钟,那么我应该期望的最小和最大通勤时间是多少(可能是95%)?
I wish to know, for a given predicted commute journey duration in minutes, the range of actual commute times I might expect. For example, if Google Maps predicts my commute to be 20 minutes, what is the minimum and maximum commute I should expect (perhaps a 95% range)?
让我们将数据导入熊猫:
Let's import my data into pandas:
%matplotlib inline
import pandas as pd
commutes = pd.read_csv('https://raw.githubusercontent.com/blokeley/commutes/master/commutes.csv')
commutes.tail()
这给出了:
我们可以轻松地创建一个显示原始数据散点图,回归曲线以及该曲线上95%的置信区间的图:
We can create a plot easily which shows the scatter of raw data, a regression curve, and the 95% confidence interval on that curve:
import seaborn as sns
# Create a linear model plot
sns.lmplot('prediction', 'duration', commutes);
我现在该如何计算和绘制95%的实际通勤时间与预测时间的范围?
How do I now calculate and plot the 95% range of actual commute times versus predicted times?
换一种说法,如果Google Maps预测我的通勤时间为20分钟,那么看来它实际上可能需要14到28分钟之间的任何时间.计算或绘制该范围将是很棒的.
Put another way, if Google Maps predicts my commute to take 20 minutes, it looks like it could actually take anywhere between something like 14 and 28 minutes. It would be great to calculate or plot this range.
在此先感谢您的帮助.
推荐答案
通勤的实际持续时间与预测之间的关系应该是线性的,因此我可以使用分位数回归:
The relationship between actual duration of a commute and the prediction should be linear, so I can use quantile regression:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
# Import data and print the last few rows
commutes = pd.read_csv('https://raw.githubusercontent.com/blokeley/commutes/master/commutes.csv')
# Create the quantile regression model
model = smf.quantreg('duration ~ prediction', commutes)
# Create a list of quantiles to calculate
quantiles = [0.05, 0.25, 0.50, 0.75, 0.95]
# Create a list of fits
fits = [model.fit(q=q) for q in quantiles]
# Create a new figure and axes
figure, axes = plt.subplots()
# Plot the scatter of data points
x = commutes['prediction']
axes.scatter(x, commutes['duration'], alpha=0.4)
# Create an array of predictions from the minimum to maximum to create the regression line
_x = np.linspace(x.min(), x.max())
for index, quantile in enumerate(quantiles):
# Plot the quantile lines
_y = fits[index].params['prediction'] * _x + fits[index].params['Intercept']
axes.plot(_x, _y, label=quantile)
# Plot the line of perfect prediction
axes.plot(_x, _x, 'g--', label='Perfect prediction')
axes.legend()
axes.set_xlabel('Predicted duration (minutes)')
axes.set_ylabel('Actual duration (minutes)');
这给出了:
非常感谢我的同事Philip提供的分位数回归技巧.
Many thanks to my colleague Philip for the quantile regression tip.
这篇关于在Python中的散点图上计算并绘制95%的数据范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!