问题描述
我有一个数据帧,看起来像这样:
I have a dataframe that looks like this:
Out[14]:
impwealth indweight
16 180000 34.200
21 384000 37.800
26 342000 39.715
30 1154000 44.375
31 421300 44.375
32 1210000 45.295
33 1062500 45.295
34 1878000 46.653
35 876000 46.653
36 925000 53.476
我要计算使用频率的权重列 impwealth
的 indweight
的加权中值。我的伪code是这样的:
I want to calculate the weighted median of the column impwealth
using the frequency weights in indweight
. My pseudo code looks like this:
# Sort `impwealth` in ascending order
df.sort('impwealth', 'inplace'=True)
# Find the 50th percentile weight, P
P = df['indweight'].sum() * (.5)
# Search for the first occurrence of `impweight` that is greater than P
i = df.loc[df['indweight'] > P, 'indweight'].last_valid_index()
# The value of `impwealth` associated with this index will be the weighted median
w_median = df.ix[i, 'impwealth']
这个方法似乎笨重,而且我不知道这是正确的。我没有找到一个内置的方式做到这一点的大熊猫参考。什么是去寻找加权中值的最佳方式是什么?
This method seems clunky, and I'm not sure it's correct. I didn't find a built in way to do this in pandas reference. What is the best way to go about finding weighted median?
推荐答案
您是否尝试过 wqantiles 包?我从来没有使用过它,但它有一个加权中值函数,似乎给至少一个合理的答案(你可能想仔细检查,它的使用你所期望的方式)。
Have you tried the wqantiles package? I had never used it before, but it has a weighted median function that seems to give at least a reasonable answer (you'll probably want to double check that it's using the approach you expect).
In [12]: import weighted
In [13]: weighted.median(df['impwealth'], df['indweight'])
Out[13]: 914662.0859091772
这篇关于Python的:加权平均算法大 pandas 的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!