使用Pandas识别每个过滤器的一列中最接近

使用Pandas识别每个过滤器的一列中最接近

本文介绍了使用Pandas识别每个过滤器的一列中最接近的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含类别和值的数据框.我需要在每个类别中找到最接近值的值.我想我已经接近了,但是将argsort的结果应用于原始数据帧时,我并不能真正获得正确的输出.

I have a data frame with categories and values. I need to find the value in each category closest to a value. I think I'm close but I can't really get the right output when applying the results of argsort to the original dataframe.

例如,如果输入是在下面的代码中定义的,则输出应仅包含(a, 1, True)(b, 2, True)(c, 2, True),而所有其他isClosest Values应该为False.

For example, if the input was defined in the code below the output should have only (a, 1, True), (b, 2, True), (c, 2, True) and all other isClosest Values should be False.

如果多个值最接近,则它应该是列出的第一个标记的值.

If multiple values are closest then it should be the first value listed marked.

这是我可以使用的代码,但是我无法正确将其重新应用于数据框.我会喜欢一些指针.

Here is the code I have which works but I can't get it to reapply to the dataframe correctly. I would love some pointers.

df = pd.DataFrame()
df['category'] = ['a', 'b', 'b', 'b', 'c', 'a', 'b', 'c', 'c', 'a']
df['values'] = [1, 2, 3, 4, 5, 4, 3, 2, 1, 0]
df['isClosest'] = False

uniqueCategories = df['category'].unique()
for c in uniqueCategories:
    filteredCategories = df[df['category']==c]
    sortargs = (filteredCategories['value']-2.0).abs().argsort()
    #how to use sortargs so that we set column in df isClosest=True if its the closest value in each category to 2.0?

推荐答案

您可以创建一列绝对差异:

You can create a column of absolute differences:

df['dif'] = (df['values'] - 2).abs()

df
Out:
  category  values  dif
0        a       1    1
1        b       2    0
2        b       3    1
3        b       4    2
4        c       5    3
5        a       4    2
6        b       3    1
7        c       2    0
8        c       1    1
9        a       0    2

然后使用groupby.transform检查每个组的最小值是否等于您计算出的差值:

And then use groupby.transform to check whether the minimum value of each group is equal to the difference you calculated:

df['is_closest'] = df.groupby('category')['dif'].transform('min') == df['dif']

df
Out:
  category  values  dif is_closest
0        a       1    1       True
1        b       2    0       True
2        b       3    1      False
3        b       4    2      False
4        c       5    3      False
5        a       4    2      False
6        b       3    1      False
7        c       2    0       True
8        c       1    1      False
9        a       0    2      False

df.groupby('category')['dif'].idxmin()还将为您提供每个类别的最接近值的索引.您也可以将其用于映射.

df.groupby('category')['dif'].idxmin() would also give you the indices of the closest values for each category. You can use that for mapping too.

供选择:

df.loc[df.groupby('category')['dif'].idxmin()]
Out:
  category  values  dif
0        a       1    1
1        b       2    0
7        c       2    0

要分配:

df['is_closest'] = False
df.loc[df.groupby('category')['dif'].idxmin(), 'is_closest'] = True
df
Out:
  category  values  dif is_closest
0        a       1    1       True
1        b       2    0       True
2        b       3    1      False
3        b       4    2      False
4        c       5    3      False
5        a       4    2      False
6        b       3    1      False
7        c       2    0       True
8        c       1    1      False
9        a       0    2      False

这两种方法之间的区别在于,如果对照差异检查相等性,则在出现联系的情况下,所有行都将为True.但是,对于idxmin,它将在首次出现时返回True(每个组仅返回一个).

The difference between these approaches is that if you check equality against the difference, you would get True for all rows in case of ties. However, with idxmin it will return True for the first occurrence (only one for each group).

这篇关于使用Pandas识别每个过滤器的一列中最接近的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-06 05:47