本文介绍了PANDAS 中类似 SQL 的窗口函数:Python Pandas Dataframe 中的行编号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我来自 sql 背景,我经常使用以下数据处理步骤:

I come from a sql background and I use the following data processing step frequently:

  1. 按一个或多个字段对数据表进行分区
  2. 对于每个分区,向其每一行添加一个行号,按一个或多个其他字段对行进行排名,分析师指定升序或降序

前:

df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
           'data1' : [1,2,2,3,3],
           'data2' : [1,10,2,3,30]})
df
     data1        data2     key1
0    1            1         a
1    2            10        a
2    2            2         a
3    3            3         b
4    3            30        a

我正在寻找如何执行与此 sql 窗口函数等效的 PANDAS:

I'm looking for how to do the PANDAS equivalent to this sql window function:

RN = ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 ASC, Data2 DESC)


    data1        data2     key1    RN
0    1            1         a       1
1    2            10        a       2
2    2            2         a       3
3    3            3         b       1
4    3            30        a       4

我已经尝试了以下在没有分区"的情况下工作的方法:

I've tried the following which I've gotten to work where there are no 'partitions':

def row_number(frame,orderby_columns, orderby_direction,name):
    frame.sort_index(by = orderby_columns, ascending = orderby_direction, inplace = True)
    frame[name] = list(xrange(len(frame.index)))

我尝试将这个想法扩展到分区(pandas 中的组),但以下方法不起作用:

I tried to extend this idea to work with partitions (groups in pandas) but the following didn't work:

df1 = df.groupby('key1').apply(lambda t: t.sort_index(by=['data1', 'data2'], ascending=[True, False], inplace = True)).reset_index()

def nf(x):
    x['rn'] = list(xrange(len(x.index)))

df1['rn1'] = df1.groupby('key1').apply(nf)

但是当我这样做时,我得到了很多 NaN.

But I just got a lot of NaNs when I do this.

理想情况下,应该有一种简洁的方式来复制 sql 的窗口函数功能(我已经找到了基于窗口的聚合……这是 Pandas 中的一个单行)……有人可以与我分享最多吗?在 PANDAS 中给这样的行编号的惯用方法?

Ideally, there'd be a succinct way to replicate the window function capability of sql (i've figured out the window based aggregates...that's a one liner in pandas)...can someone share with me the most idiomatic way to number rows like this in PANDAS?

推荐答案

您可以通过将 groupbyrank 方法一起使用两次来做到这一点:

You can do this by using groupby twice along with the rank method:

In [11]: g = df.groupby('key1')

使用 min 方法参数为共享相同数据的值提供相同的 RN:

Use the min method argument to give values which share the same data1 the same RN:

In [12]: g['data1'].rank(method='min')
Out[12]:
0    1
1    2
2    2
3    1
4    4
dtype: float64

In [13]: df['RN'] = g['data1'].rank(method='min')

然后将这些结果分组并添加相对于 data2 的排名:

And then groupby these results and add the rank with respect to data2:

In [14]: g1 = df.groupby(['key1', 'RN'])

In [15]: g1['data2'].rank(ascending=False) - 1
Out[15]:
0    0
1    0
2    1
3    0
4    0
dtype: float64

In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1

In [17]: df
Out[17]:
   data1  data2 key1  RN
0      1      1    a   1
1      2     10    a   2
2      2      2    a   3
3      3      3    b   1
4      3     30    a   4

感觉应该有一种本地方式来做到这一点(很可能有!...).

这篇关于PANDAS 中类似 SQL 的窗口函数:Python Pandas Dataframe 中的行编号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 15:13