使用另一列的偏移值比较Pandas数据框列中的值

使用另一列的偏移值比较Pandas数据框列中的值

本文介绍了使用另一列的偏移值比较Pandas数据框列中的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据框为:

Time    InvInstance
5          5
8          4
9          3
19         2
20         1
3          3
8          2
13         1

Time变量已排序,InvInstance变量表示到Time块末尾的行数.我想创建另一列,以显示Time列中是否满足交叉条件.我可以用这样的for循环来做到这一点:

import pandas as pd
import numpy as np

df = pd.read_csv("test.csv")

df["10mMark"] = 0
for i in range(1,len(df)):
    r = int(df.InvInstance.iloc[i])
    rprev = int(df.InvInstance.iloc[i-1])
    m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
    mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
    df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)

所需的输出是:

Time  InvInstance  10mMark
 5            5        0
 8            4        0
 9            3        0
19            2        1
20            1        0
 3            3        0
 8            2        1
13            1        0

更具体些;在时间"列中有2个排序的时间块,并且逐行移动,我们通过InvInstance的值知道到每个块末尾的距离(以行为单位).问题是行与块末尾之间的时间差是否小于10分钟,并且在上一行中是否大于10.是否可以在没有shift()等循环的情况下执行此操作,从而使其运行得更快?

解决方案

实际上,您的m是一行时间与'block'末尾时间之间的时间差,而mprev是同样的事情,只是时间在前一行(所以实际上是m的移位).我的想法是通过首先标识每个块,然后在block上使用groupby时,用last时间创建merge,从而创建一个包含块末尾时间的列.然后计算创建列'm'的差,并使用np.where并移位以最终填充列10mMark.

# a column with incremental value for each block end
df['block'] = df.InvInstance[df.InvInstance ==1].cumsum()
#to back fill the number to get all block with same value of block
df['block'] = df['block'].bfill() #to back fill the number
# now merge to create a column time_last with the time at the end of the block
df = df.merge(df.groupby('block', as_index=False)['Time'].last(), on = 'block', suffixes=('','_last'), how='left')
# create column m with just a difference
df['m'] = df['Time_last'] - df['Time']
# now you can use np.where and shift on this column to create the 10mMark column
df['10mMark'] = np.where((df['m'] < 10) & (df['m'].shift() >= 10),1,0)
#just drop the useless column
df = df.drop(['block', 'Time_last','m'],1)

删除之前的最终结果,看看创建的结果是什么

   Time  InvInstance  block  Time_last   m  10mMark
0     5            5    1.0         20  15        0
1     8            4    1.0         20  12        0
2     9            3    1.0         20  11        0
3    19            2    1.0         20   1        1
4    20            1    1.0         20   0        0
5     3            3    2.0         13  10        0
6     8            2    2.0         13   5        1
7    13            1    2.0         13   0        0

其中10mMark列具有预期结果

它的效率不及使用Numba @MaxU 的解决方案,但是使用他使用的8000行的df时,我得到的加速因子约为350. /p>

I have a data frame as:

Time    InvInstance
5          5
8          4
9          3
19         2
20         1
3          3
8          2
13         1

Time variable is sorted and InvInstance variable denotes the number of rows to the end of a Time block. I want to create another column showing whether a crossover condition is met within the Time column. I can do it with a for loop like that:

import pandas as pd
import numpy as np

df = pd.read_csv("test.csv")

df["10mMark"] = 0
for i in range(1,len(df)):
    r = int(df.InvInstance.iloc[i])
    rprev = int(df.InvInstance.iloc[i-1])
    m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
    mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
    df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)

And the desired output is:

Time  InvInstance  10mMark
 5            5        0
 8            4        0
 9            3        0
19            2        1
20            1        0
 3            3        0
 8            2        1
13            1        0

To be more specific; there are 2 sorted time blocks in the Time column, and going row by row we know the distance (in terms of rows) to the end of each block by the value of InvInstance. The question is whether the time difference between a row and the end of the block is less than 10 minutes and it was greater than 10 in the previous row. Is it possible to do this without loops such as shift() etc, so that it runs much faster?

解决方案

Actually, your m is the time delta between the time of a row and the time at the end of the 'block' and the mprev is the same thing but with the time at the previous row (so it's actually shift of m). My idea is to create a column containing the time at the end of the block, by first identifying each block, then merge with the last time when using groupby on block . Then calculate the difference for creating a column 'm' and use the np.where and shift to finally fill the column 10mMark.

# a column with incremental value for each block end
df['block'] = df.InvInstance[df.InvInstance ==1].cumsum()
#to back fill the number to get all block with same value of block
df['block'] = df['block'].bfill() #to back fill the number
# now merge to create a column time_last with the time at the end of the block
df = df.merge(df.groupby('block', as_index=False)['Time'].last(), on = 'block', suffixes=('','_last'), how='left')
# create column m with just a difference
df['m'] = df['Time_last'] - df['Time']
# now you can use np.where and shift on this column to create the 10mMark column
df['10mMark'] = np.where((df['m'] < 10) & (df['m'].shift() >= 10),1,0)
#just drop the useless column
df = df.drop(['block', 'Time_last','m'],1)

your final result before dropping, to see what as been created, looks like

   Time  InvInstance  block  Time_last   m  10mMark
0     5            5    1.0         20  15        0
1     8            4    1.0         20  12        0
2     9            3    1.0         20  11        0
3    19            2    1.0         20   1        1
4    20            1    1.0         20   0        0
5     3            3    2.0         13  10        0
6     8            2    2.0         13   5        1
7    13            1    2.0         13   0        0

in which the column 10mMark has the expected result

It is not as efficient as with the solution of @MaxU with Numba, but with a df of 8000 rows as he used, I get speed up factor of about 350.

这篇关于使用另一列的偏移值比较Pandas数据框列中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-26 11:32