问题描述
我的数据框为:
Time InvInstance
5 5
8 4
9 3
19 2
20 1
3 3
8 2
13 1
Time
变量已排序,InvInstance
变量表示到Time
块末尾的行数.我想创建另一列,以显示Time
列中是否满足交叉条件.我可以用这样的for循环来做到这一点:
import pandas as pd
import numpy as np
df = pd.read_csv("test.csv")
df["10mMark"] = 0
for i in range(1,len(df)):
r = int(df.InvInstance.iloc[i])
rprev = int(df.InvInstance.iloc[i-1])
m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
所需的输出是:
Time InvInstance 10mMark
5 5 0
8 4 0
9 3 0
19 2 1
20 1 0
3 3 0
8 2 1
13 1 0
更具体些;在时间"列中有2个排序的时间块,并且逐行移动,我们通过InvInstance的值知道到每个块末尾的距离(以行为单位).问题是行与块末尾之间的时间差是否小于10分钟,并且在上一行中是否大于10.是否可以在没有shift()
等循环的情况下执行此操作,从而使其运行得更快?
实际上,您的m
是一行时间与'block'末尾时间之间的时间差,而mprev
是同样的事情,只是时间在前一行(所以实际上是m
的移位).我的想法是通过首先标识每个块,然后在block上使用groupby
时,用last
时间创建merge
,从而创建一个包含块末尾时间的列.然后计算创建列'm'的差,并使用np.where并移位以最终填充列10mMark.
# a column with incremental value for each block end
df['block'] = df.InvInstance[df.InvInstance ==1].cumsum()
#to back fill the number to get all block with same value of block
df['block'] = df['block'].bfill() #to back fill the number
# now merge to create a column time_last with the time at the end of the block
df = df.merge(df.groupby('block', as_index=False)['Time'].last(), on = 'block', suffixes=('','_last'), how='left')
# create column m with just a difference
df['m'] = df['Time_last'] - df['Time']
# now you can use np.where and shift on this column to create the 10mMark column
df['10mMark'] = np.where((df['m'] < 10) & (df['m'].shift() >= 10),1,0)
#just drop the useless column
df = df.drop(['block', 'Time_last','m'],1)
删除之前的最终结果,看看创建的结果是什么
Time InvInstance block Time_last m 10mMark
0 5 5 1.0 20 15 0
1 8 4 1.0 20 12 0
2 9 3 1.0 20 11 0
3 19 2 1.0 20 1 1
4 20 1 1.0 20 0 0
5 3 3 2.0 13 10 0
6 8 2 2.0 13 5 1
7 13 1 2.0 13 0 0
其中10mMark列具有预期结果
它的效率不及使用Numba
的 @MaxU 的解决方案,但是使用他使用的8000行的df
时,我得到的加速因子约为350. /p>
I have a data frame as:
Time InvInstance
5 5
8 4
9 3
19 2
20 1
3 3
8 2
13 1
Time
variable is sorted and InvInstance
variable denotes the number of rows to the end of a Time
block. I want to create another column showing whether a crossover condition is met within the Time
column. I can do it with a for loop like that:
import pandas as pd
import numpy as np
df = pd.read_csv("test.csv")
df["10mMark"] = 0
for i in range(1,len(df)):
r = int(df.InvInstance.iloc[i])
rprev = int(df.InvInstance.iloc[i-1])
m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
And the desired output is:
Time InvInstance 10mMark
5 5 0
8 4 0
9 3 0
19 2 1
20 1 0
3 3 0
8 2 1
13 1 0
To be more specific; there are 2 sorted time blocks in the Time column, and going row by row we know the distance (in terms of rows) to the end of each block by the value of InvInstance. The question is whether the time difference between a row and the end of the block is less than 10 minutes and it was greater than 10 in the previous row. Is it possible to do this without loops such as shift()
etc, so that it runs much faster?
Actually, your m
is the time delta between the time of a row and the time at the end of the 'block' and the mprev
is the same thing but with the time at the previous row (so it's actually shift of m
). My idea is to create a column containing the time at the end of the block, by first identifying each block, then merge
with the last
time when using groupby
on block . Then calculate the difference for creating a column 'm' and use the np.where and shift to finally fill the column 10mMark.
# a column with incremental value for each block end
df['block'] = df.InvInstance[df.InvInstance ==1].cumsum()
#to back fill the number to get all block with same value of block
df['block'] = df['block'].bfill() #to back fill the number
# now merge to create a column time_last with the time at the end of the block
df = df.merge(df.groupby('block', as_index=False)['Time'].last(), on = 'block', suffixes=('','_last'), how='left')
# create column m with just a difference
df['m'] = df['Time_last'] - df['Time']
# now you can use np.where and shift on this column to create the 10mMark column
df['10mMark'] = np.where((df['m'] < 10) & (df['m'].shift() >= 10),1,0)
#just drop the useless column
df = df.drop(['block', 'Time_last','m'],1)
your final result before dropping, to see what as been created, looks like
Time InvInstance block Time_last m 10mMark
0 5 5 1.0 20 15 0
1 8 4 1.0 20 12 0
2 9 3 1.0 20 11 0
3 19 2 1.0 20 1 1
4 20 1 1.0 20 0 0
5 3 3 2.0 13 10 0
6 8 2 2.0 13 5 1
7 13 1 2.0 13 0 0
in which the column 10mMark has the expected result
It is not as efficient as with the solution of @MaxU with Numba
, but with a df
of 8000 rows as he used, I get speed up factor of about 350.
这篇关于使用另一列的偏移值比较Pandas数据框列中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!