下面是我非常大的数据框的一个小样本:
In [38]: df
Out[38]:
Send_Customer Pay_Customer Send_Time
0 1000000000284044644 1000000000251680999 2016-08-01 09:55:48
1 2000000000223021617 1000000000190078650 2016-08-01 02:44:23
2 2000000000289301033 1000000000309048473 2016-08-01 09:20:14
3 1000000000333893941 1000000000333956151 2016-08-01 09:20:14
4 1000000000340371553 2000000000103942022 2016-08-01 09:20:14
5 2000000000098132192 2000000000089264458 2016-08-01 09:21:27
6 1000000000007716594 2000000000144437513 2016-08-01 09:20:54
7 1000000000135884145 1000000000278399847 2016-08-01 09:21:43
8 2000000000141318366 2000000000151080468 2016-08-01 09:20:46
9 1000000000056842546 2000000000139908360 2016-08-01 09:20:55
10 1000000000275051425 2000000000254558241 2016-08-01 09:20:17
11 1000000000162362467 1000000000340653197 2016-08-01 09:23:45
12 1000000000039529533 1000000000072903285 2016-08-01 09:22:56
13 1000000000034147075 2000000000079408765 2016-08-01 09:20:17
14 1000000000319501203 1000000000337830072 2016-08-01 09:20:20
15 1000000000025289495 2000000000287368163 2016-08-01 09:20:31
16 1000000000043110429 1000000000209850047 2016-08-01 09:22:33
我需要找出在 10 小时内,
Pay_Customers
有多少个非唯一或唯一 Send_Customer
?所以,这是我正在使用的方法:
In [39]: df['time_diff'] = df.groupby('Send_Customer')['Send_Time'].apply(lambda x : x.diff().abs())
In [41]: df[df['time_diff']<=dt.timedelta(seconds=36000)]
Out[41]:
Send_Customer Pay_Customer Send_Time \
4361 1000000000284044644 1000000000326834813 2016-08-01 14:32:17
7530 2000000000223021617 1000000000340199555 2016-08-01 04:49:41
10937 2000000000148219588 1000000000312697109 2016-08-01 04:49:40
12876 1000000000339947901 2000000000218218239 2016-08-01 14:51:51
13553 1000000000248905073 1000000000248729812 2016-08-01 16:44:35
14281 2000000000270573223 1000000000341120021 2016-08-01 09:35:11
time_diff
4361 00:10:37
7530 00:17:06
10937 01:09:45
12876 00:53:59
13553 01:12:17
14281 05:19:34
这种方法部分起作用,因为在
.diff()
上使用 ['Send_Time']
消除了用于获取差异的第一行。关于如何保留这些行的任何想法? 最佳答案
如果我理解正确:在 diff
之后,第一行是 NaT
。为了保留第一行,您可以将 NaT
值替换为不会被您的条件过滤掉的值,例如 0
。
在这里,我只是在第一行的末尾添加 .fillna(0)
:
df['time_diff'] = df.groupby('Send_Customer')['Send_Time'].apply(
lambda x : x.diff().abs()
).fillna(0)
df[df['time_diff'] <= dt.timedelta(seconds=36000)]
关于python - Pandas 中的复杂时间操作,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/39602785/