本文介绍了Pandas:比较组内的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个按键"分组的数据框.我需要比较每个组中的行,以确定是要保留组的每一行还是只需要组的一行.

I have a dataframe that is grouped by 'Key'. I need to compare rows within each group to identify whether I want to keep each row of the group or whether I want just one row of a group.

在保留一组所有行的条件下:如果有一行颜色为红色",面积为12",形状为圆形",并且另一行(在同一组内)具有绿色"的颜色和13"的面积和正方形"的形状,然后我想保留该组中的所有行.否则,如果这种情况不存在,我想保留该组中具有最大 'num' 值的行.

In the condition to keep all rows of a group: if there is one row that has the color 'red' and area of '12' and shape of 'circle' AND another row (within the same group) that has a color of 'green' and an area of '13' and shape of 'square', then I want to keep all rows in that group. Otherwise if this scenario does not exist, I want to keep the row of that group with the largest 'num' value.

df = pd.DataFrame({'KEY': ['100000009', '100000009', '100000009', '100000009', '100000009','100000034','100000034', '100000034'],
              'Date1': [20120506, 20120506, 20120507,20120608,20120620,20120206,20120306,20120405],
              'shape': ['circle', 'square', 'circle','circle','circle','circle','circle','circle'],
              'num': [3,4,5,6,7,8,9,10],
              'area': [12, 13, 12,12,12,12,12,12],
              'color': ['red', 'green', 'red','red','red','red','red','red']})


    Date1       KEY        area color   num shape
0   2012-05-06  100000009   12  red     3   circle
1   2012-05-06  100000009   13  green   4   square
2   2012-05-07  100000009   12  red     5   circle
3   2012-06-08  100000009   12  red     6   circle
4   2012-06-20  100000009   12  red     7   circle
5   2012-02-06  100000034   12  red     8   circle
6   2012-03-06  100000034   12  red     9   circle
7   2012-04-05  100000034   12  red     10  circle

预期结果:

    Date1       KEY        area color   num shape
0   2012-05-06  100000009   12  red     3   circle
1   2012-05-06  100000009   13  green   4   square
2   2012-05-07  100000009   12  red     5   circle
3   2012-06-08  100000009   12  red     6   circle
4   2012-06-20  100000009   12  red     7   circle
7   2012-04-05  100000034   12  red     10  circle

我是 Python 新手,groupby 给我扔了一个曲线球.

I am new to python, and groupby is throwing me a curve ball.

maxnum = df.groupby('KEY')['num'].transform(max)
df = df.loc[df.num == maxnum]

cond1 = (df[df['area'] == 12]) & (df[df['color'] == 'red']) & (df[df['shape'] == 'circle'])
cond2 = (df[df['area'] == 13]) & (df[df['color'] == 'green']) & (df[df['shape'] == 'square'])

推荐答案

定义一个名为 function 的自定义函数:

Define a custom function called function:

def function(x):
    i = x.query(
        'area == 12 and color == "red" and shape == "circle"'
    )
    j = x.query(
        'area == 13 and color == "green" and shape == "square"'
    )
    return x if not (i.empty or j.empty) else x[x.num == x.num.max()].head(1)

此函数在指定条件下测试每个组并返回适当的行.特别是,它使用 df.empty 查询条件和测试是否为空.

This function tests each group on the specified conditions and returns rows as appropriate. In particular, it queries on the conditions and tests for emptiness using df.empty.

将此传递给 groupby + apply:

df.groupby('KEY', group_keys=False).apply(function)


      Date1        KEY  area  color  num   shape
0  20120506  100000009    12    red    3  circle
1  20120506  100000009    13  green    4  square
2  20120507  100000009    12    red    5  circle
3  20120608  100000009    12    red    6  circle
4  20120620  100000009    12    red    7  circle
7  20120405  100000034    12    red   10  circle

这篇关于Pandas:比较组内的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 03:26