假设我有以下熊猫数据框:

df = pd.DataFrame({'name':['Dave','Lisa','John',Lisa','Simon','Simon','Simon','Simon','Lisa','Dave','Dave','John','Lisa'],
'date': ['2015-01-31 07:14:39','2014-12-16 22:50:55','2015-04-12 23:29:11','2015-04-08 17:57:29','2015-01-30 03:51:12','2015-02-20 10:33:48','2014-12-15 23:54:03','2014-12-16 19:53:53','2014-12-18 00:15:02','2015-04-01 21:36:55','2015-04-13 23:25:55','2015-02-18 14:10:40','2015-02-27 04:56:33']})


数据帧1

            date            name
0   2015-01-31 07:14:39     Dave
1   2014-12-16 22:50:55     Lisa
2   2015-04-12 23:29:11     John
3   2015-04-08 17:57:29     Lisa
4   2015-01-30 03:51:12     Simon
5   2015-02-20 10:33:48     Simon
6   2014-12-15 23:54:03     Simon
7   2014-12-16 19:53:53     Simon
8   2014-12-18 00:15:02     Lisa
9   2015-04-01 21:36:55     Dave
10  2015-04-13 23:25:55     Dave
11  2015-02-18 14:10:40     John
12  2015-02-27 04:56:33     Lisa


数据框2

    name           datemax
0   Dave    2015-04-13 23:25:55
1   John    2015-04-12 23:29:11
2   Lisa    2015-04-08 17:57:29
3   Simon   2015-02-20 10:33:48


其中“ date”和“ datemax”列填充有datetime对象。

我需要在DATAFRAME1中按“名称”分组,随机选择一个日期,但我希望此选择的日期在第二个数据帧(DATAFRAME2)中该名称的“ datemax”之前。

我正在处理的实际数据框比本示例中的实际数据框大得多,因此我需要一种快速的方法来完成此操作。

最佳答案

首先,我将剔除所有不符合该条件的日期:

In [11]: df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
Out[11]:
0    2015-04-13 23:25:55
1    2015-04-08 17:57:29
2    2015-04-12 23:29:11
3    2015-04-08 17:57:29
4    2015-02-20 10:33:48
5    2015-02-20 10:33:48
6    2015-02-20 10:33:48
7    2015-02-20 10:33:48
8    2015-04-08 17:57:29
9    2015-04-13 23:25:55
10   2015-04-13 23:25:55
11   2015-04-12 23:29:11
12   2015-04-08 17:57:29
Name: date, dtype: datetime64[ns]

In [12]: df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
Out[12]:
0      True
1      True
2     False
3     False
4      True
5     False
6      True
7      True
8      True
9      True
10    False
11     True
12     True
Name: date, dtype: bool

In [13]: df_old = df[df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])]

In [14]: df_old
Out[14]:
                  date   name
0  2015-01-31 07:14:39   Dave
1  2014-12-16 22:50:55   Lisa
4  2015-01-30 03:51:12  Simon
6  2014-12-15 23:54:03  Simon
7  2014-12-16 19:53:53  Simon
8  2014-12-18 00:15:02   Lisa
9  2015-04-01 21:36:55   Dave
11 2015-02-18 14:10:40   John
12 2015-02-27 04:56:33   Lisa


现在,它变成一个容易得多的问题:pick a random row by name

df_old.groupby("name").agg(lambda x: x.iloc[np.random.randint(0,len(x))])

In [21]: df_old.groupby("name").agg(lambda x: x.iloc[np.random.randint(0,len(x))])
Out[21]:
                     date
name
Dave  2015-04-01 21:36:55
John  2015-02-18 14:10:40
Lisa  2014-12-16 22:50:55
Simon 2014-12-15 23:54:03

In [22]: df_old.groupby("name").agg(lambda x: x.iloc[np.random.randint(0,len(x))])
Out[22]:
                     date
name
Dave  2015-01-31 07:14:39
John  2015-02-18 14:10:40
Lisa  2014-12-18 00:15:02
Simon 2014-12-16 19:53:53

关于python - 每组具有 bool 条件的Pandas数据框随机行选择,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/34299672/

10-11 15:02
查看更多