我有一个由时间戳索引的大数据帧,在该数据帧中,我希望根据时间范围将行分配给组。
例如,在下面的数据中,我已经在组中第一个条目的1毫秒内对行进行了分组。
groupid
1999-12-31 23:59:59.000107 1
1999-12-31 23:59:59.000385 1
1999-12-31 23:59:59.000404 1
1999-12-31 23:59:59.000704 1
1999-12-31 23:59:59.001281 2
1999-12-31 23:59:59.002211 2
1999-12-31 23:59:59.002367 3
我有工作代码,通过迭代行并使用当前行分割数据帧来完成此操作:
dts = sorted([datetime(1999, 12, 31, 23, 59, 59, x) for
x in np.random.randint(1, 999999, 1000)])
df = pd.DataFrame({'groupid': None}, dts)
print df.head(20)
groupid = 1
for dt, row in df.iterrows():
if df.loc[row.name].groupid:
continue
end = dt + timedelta(milliseconds=1)
group = df.loc[dt:end]
df.loc[group.index, 'groupid'] = groupid
groupid += 1
print df.head(20)
但是,与iterows一样,在大型数据帧上的操作也很慢。我已经尝试过应用一个函数和使用groupby,但是没有成功。使用iTurtuples是我为提高性能所能做的最好的事情吗(我现在要尝试一下)?有人能给点建议吗?
最佳答案
好的,我认为下面是您想要的,这将通过用第一个值减去所有值,从索引中构造一个TimeDelta。然后访问微秒组件并除以1000,然后将Series dtype转换为int:
In [86]:
df['groupid'] = ((df.index.to_series() - df.index[0]).dt.microseconds / 1000).astype(np.int32)
df
Out[86]:
groupid
1999-12-31 23:59:59.000133 0
1999-12-31 23:59:59.000584 0
1999-12-31 23:59:59.003544 3
1999-12-31 23:59:59.009193 9
1999-12-31 23:59:59.010220 10
1999-12-31 23:59:59.010632 10
1999-12-31 23:59:59.010716 10
1999-12-31 23:59:59.011387 11
1999-12-31 23:59:59.011837 11
1999-12-31 23:59:59.013277 13
1999-12-31 23:59:59.013305 13
1999-12-31 23:59:59.014754 14
1999-12-31 23:59:59.016015 15
1999-12-31 23:59:59.016067 15
1999-12-31 23:59:59.017788 17
1999-12-31 23:59:59.018236 18
1999-12-31 23:59:59.021281 21
1999-12-31 23:59:59.021772 21
1999-12-31 23:59:59.021927 21
1999-12-31 23:59:59.022200 22
1999-12-31 23:59:59.023104 22
1999-12-31 23:59:59.023375 23
1999-12-31 23:59:59.023688 23
1999-12-31 23:59:59.023726 23
1999-12-31 23:59:59.025397 25
1999-12-31 23:59:59.026407 26
1999-12-31 23:59:59.026480 26
1999-12-31 23:59:59.027825 27
1999-12-31 23:59:59.028793 28
1999-12-31 23:59:59.030716 30
... ...
1999-12-31 23:59:59.975432 975
1999-12-31 23:59:59.976699 976
1999-12-31 23:59:59.977177 977
1999-12-31 23:59:59.979475 979
1999-12-31 23:59:59.980282 980
1999-12-31 23:59:59.980672 980
1999-12-31 23:59:59.983202 983
1999-12-31 23:59:59.984214 984
1999-12-31 23:59:59.984674 984
1999-12-31 23:59:59.984933 984
1999-12-31 23:59:59.985664 985
1999-12-31 23:59:59.985779 985
1999-12-31 23:59:59.988812 988
1999-12-31 23:59:59.989324 989
1999-12-31 23:59:59.990386 990
1999-12-31 23:59:59.990485 990
1999-12-31 23:59:59.990969 990
1999-12-31 23:59:59.991255 991
1999-12-31 23:59:59.991739 991
1999-12-31 23:59:59.993979 993
1999-12-31 23:59:59.994705 994
1999-12-31 23:59:59.994874 994
1999-12-31 23:59:59.995397 995
1999-12-31 23:59:59.995753 995
1999-12-31 23:59:59.995863 995
1999-12-31 23:59:59.996574 996
1999-12-31 23:59:59.998139 998
1999-12-31 23:59:59.998533 998
1999-12-31 23:59:59.998778 998
1999-12-31 23:59:59.999915 999
感谢@Jeff指出了更干净的方法:
In [96]:
df['groupid'] = (df.index-df.index[0]).astype('timedelta64[ms]')
df
Out[96]:
groupid
1999-12-31 23:59:59.000884 0
1999-12-31 23:59:59.001175 0
1999-12-31 23:59:59.001262 0
1999-12-31 23:59:59.001540 0
1999-12-31 23:59:59.001769 0
1999-12-31 23:59:59.002478 1
1999-12-31 23:59:59.005001 4
1999-12-31 23:59:59.005497 4
1999-12-31 23:59:59.006908 6
1999-12-31 23:59:59.008860 7
1999-12-31 23:59:59.009257 8
1999-12-31 23:59:59.010012 9
1999-12-31 23:59:59.011451 10
1999-12-31 23:59:59.013177 12
1999-12-31 23:59:59.014138 13
1999-12-31 23:59:59.015795 14
1999-12-31 23:59:59.015865 14
1999-12-31 23:59:59.016069 15
1999-12-31 23:59:59.016666 15
1999-12-31 23:59:59.016718 15
1999-12-31 23:59:59.019058 18
1999-12-31 23:59:59.019675 18
1999-12-31 23:59:59.020747 19
1999-12-31 23:59:59.021856 20
1999-12-31 23:59:59.022959 22
1999-12-31 23:59:59.023812 22
1999-12-31 23:59:59.023938 23
1999-12-31 23:59:59.024122 23
1999-12-31 23:59:59.025332 24
1999-12-31 23:59:59.025397 24
... ...
1999-12-31 23:59:59.959725 958
1999-12-31 23:59:59.959742 958
1999-12-31 23:59:59.959892 959
1999-12-31 23:59:59.960345 959
1999-12-31 23:59:59.960800 959
1999-12-31 23:59:59.961054 960
1999-12-31 23:59:59.962749 961
1999-12-31 23:59:59.965681 964
1999-12-31 23:59:59.966409 965
1999-12-31 23:59:59.966558 965
1999-12-31 23:59:59.967357 966
1999-12-31 23:59:59.967842 966
1999-12-31 23:59:59.970465 969
1999-12-31 23:59:59.974022 973
1999-12-31 23:59:59.974734 973
1999-12-31 23:59:59.975879 974
1999-12-31 23:59:59.978291 977
1999-12-31 23:59:59.980483 979
1999-12-31 23:59:59.980868 979
1999-12-31 23:59:59.981417 980
1999-12-31 23:59:59.984208 983
1999-12-31 23:59:59.984639 983
1999-12-31 23:59:59.985533 984
1999-12-31 23:59:59.986785 985
1999-12-31 23:59:59.987502 986
1999-12-31 23:59:59.987914 987
1999-12-31 23:59:59.988406 987
1999-12-31 23:59:59.989436 988
1999-12-31 23:59:59.994449 993
1999-12-31 23:59:59.996657 995
关于python - 在Pandas数据框中按时间范围对行进行分组,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/32089394/