我有一个这样的数据框架:

Out[14]:
    impwealth  indweight
16     180000     34.200
21     384000     37.800
26     342000     39.715
30    1154000     44.375
31     421300     44.375
32    1210000     45.295
33    1062500     45.295
34    1878000     46.653
35     876000     46.653
36     925000     53.476

我想使用impwealth中的频率权重计算列的加权中位数。我的伪代码如下:
# Sort `impwealth` in ascending order
df.sort('impwealth', 'inplace'=True)

# Find the 50th percentile weight, P
P = df['indweight'].sum() * (.5)

# Search for the first occurrence of `impweight` that is greater than P
i = df.loc[df['indweight'] > P, 'indweight'].last_valid_index()

# The value of `impwealth` associated with this index will be the weighted median
w_median = df.ix[i, 'impwealth']

这个方法看起来很笨拙,我不确定它是否正确。在《熊猫参考》中,我没有找到一种内置的方法来实现这一点。找到加权中值的最佳方法是什么?

最佳答案

如果你想在纯熊猫身上这样做,这里有一个方法。它也不插入。(@svenkatesh,您的伪代码中缺少累积和)

df.sort_values('impwealth', inplace=True)
cumsum = df.indweight.cumsum()
cutoff = df.indweight.sum() / 2.0
median = df.impwealth[cumsum >= cutoff].iloc[0]

中位数为925000。

关于python - Python:带有pandas的加权中值算法,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/26102867/

10-11 16:00