我一直在用熊猫groupby和numpy的np.average计算加权平均值。问题似乎出在数据中的缺失(即缺失;数据中的缺失,而不是财富中的缺失)。我在下面做了一个概念性的例子。
我想要的行为是,当数据丢失时,该记录的权重也将被忽略。不能简单地删除该行,因为其他数据列中都填充有数据。我以为np.ma.average正是我所需要的,但这也给了我NaN的作用。
有什么建议么?
df = pd.DataFrame({ 'groups': ['a','a','b','a','b','b'],
'data': [3, 3, 4, 2, 2.5, np.nan],
'Weights': [1, 2, 1, 3, 1, 3]})
def wavg(subdf):
series = pd.Series()
for column in df.columns:
series['np.mean'] = np.mean(subdf['data'])
series['np.average (no weights)'] = np.average(subdf['data'])
series['np.average (weighted)'] = np.average(subdf['data'], weights=subdf['Weights'])
series['np.ma.average (weighted)'] = np.ma.average(subdf['data'], weights=subdf['Weights'])
return series
df.groupby('groups').apply(wavg)
这给我
np.mean np.average np.average np.ma.average
(no weights) (weighted) (weighted)
groups
a 2.666667 2.666667 2.5 2.5
b 3.250000 NaN NaN NaN
===================================
出于好奇,我最终使用的是:
def wavg(subdf):
series = pd.Series()
for column in columns:
df = subdf.dropna(subset=[column])
if len(df) == 0:
series[str(column)] = np.nan
else:
series[str(column)] = np.average( df[column], weights=df['Weights'])
return series
最佳答案
由于np.average
本身不处理nan
,因此您必须自己处理。最简单的方法是对subdf
进行子集设置,然后再对其进行任何处理。在subdf = subdf.dropna(subset=['data'])
的开头添加wavg
,以消除“数据”列中具有NaN的行:
def wavg(subdf):
series = pd.Series()
subdf = subdf.dropna(subset=['data'])
series['np.mean'] = np.mean(subdf['data'])
series['np.average (no weights)'] = np.average(subdf['data'])
series['np.average (weighted)'] = np.average(subdf['data'], weights=subdf['Weights'])
series['np.ma.average (weighted)'] = np.ma.average(subdf['data'], weights=subdf['Weights'])
return series
正如我在评论中建议的那样,我从
wavg
中删除了循环。您只希望返回一组平均值(即一组平均值,一组平均值,一组加权平均值,一组屏蔽平均值)。但是通过循环,您将为每个组四次重新计算同一件事(因为DataFrame中有四列)。关于python - 使用pandas groupby时缺少数据时,np.average不起作用,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/24469565/