问题描述
我想知道如何加速两个数据帧的合并.其中一个数据帧具有时间戳数据点(value
col).
I'm wondering how I can speed up a merge of two dataframes. One of the dataframes has time stamped data points (value
col).
import pandas as pd
import numpy as np
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)),
'value':np.random.uniform(-1,1,size=50)})
另一个有时间间隔信息(start_time
、end_time
,以及相关的interval_id
).
The other has time interval information (start_time
, end_time
, and associated interval_id
).
intervals = pd.DataFrame({'interval_id':np.arange(9),
'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),
'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)})
我想比下面的 for
循环更有效地合并这两个数据帧:
I'd like to merge these two dataframes more efficiently than the for
loop below:
data['interval_id'] = np.nan
for index, ser in intervals.iterrows():
in_interval = (data['time'] >= ser['start_time']) &
(data['time'] <= ser['end_time'])
data['interval_id'][in_interval] = ser['interval_id']
result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True)
我一直在想象我将能够使用 Pandas 时间序列功能,比如日期范围或 TimeGrouper,但我还没有想出比上面更 Pythonic(pandas-y?)的东西.
I keep imagining I'll be able to use pandas time series functionality, like a date range or TimeGrouper, but I have yet to figure out anything more pythonic (pandas-y?) than the above.
示例结果:
time value interval_id start_time end_time
0 0.575976 0.022727 NaN NaN NaN
1 4.607545 0.222568 0 3.618715 8.294847
2 5.179350 0.438052 0 3.618715 8.294847
3 11.069956 0.641269 1 10.301728 19.870283
4 12.387854 0.344192 1 10.301728 19.870283
5 18.889691 0.582946 1 10.301728 19.870283
6 20.850469 -0.027436 NaN NaN NaN
7 23.199618 0.731316 2 21.488868 28.968338
8 26.631284 0.570647 2 21.488868 28.968338
9 26.996397 0.597035 2 21.488868 28.968338
10 28.601867 -0.131712 2 21.488868 28.968338
11 28.660986 0.710856 2 21.488868 28.968338
12 28.875395 -0.355208 2 21.488868 28.968338
13 28.959320 -0.430759 2 21.488868 28.968338
14 29.702800 -0.554742 NaN NaN NaN
非常感谢那些精通时间序列的人的任何建议.
Any suggestions from time series-savvy people out there would be greatly appreciated.
更新,在杰夫回答之后:
Update, after Jeff's answer:
主要问题是 interval_id
与任何固定时间间隔无关(例如,间隔并不总是大约 10 秒).一个间隔可能是 10 秒,下一个可能是 2 秒,下一个可能是 100 秒,所以我不能使用 Jeff 提出的任何常规舍入方案.不幸的是,我上面的最小示例并没有说明这一点.
The main problem is that interval_id
has no relation to any regular time interval (e.g., intervals are not always approximately 10 seconds). One interval could be 10 seconds, the next could be 2 seconds, and the next could be 100 seconds, so I can't use any regular rounding scheme as Jeff proposed. Unfortunately, my minimal example above does not make that clear.
推荐答案
你可以使用 np.searchsorted 找到表示 data['time']
中的每个值在 intervals['start_time']
之间的位置的索引.然后你可以再次调用 np.searchsorted
来找到表示 data['time']
中的每个值在 intervals['end_time'].请注意,使用
np.searchsorted
依赖于 interval['start_time']
和 interval['end_time']
的排序顺序.
You could use np.searchsorted to find the indices representing where each value in data['time']
would fit between intervals['start_time']
. Then you could call np.searchsorted
again to find the indices representing where each value in data['time']
would fit between intervals['end_time']
. Note that using np.searchsorted
relies on interval['start_time']
and interval['end_time']
being in sorted order.
对于数组中的每个对应位置,当这两个索引相等时,data['time']
介于 interval['start_time']
和 之间>间隔['end_time']
.请注意,这依赖于不相交的间隔.
For each corresponding location in the arrays, where these two indices are equal, data['time']
fits in between interval['start_time']
and interval['end_time']
. Note that this relies on the intervals being disjoint.
以这种方式使用 searchsorted
比使用 for-loop
快 5 倍:
Using searchsorted
in this way is about 5 times faster than using the for-loop
:
import pandas as pd
import numpy as np
np.random.seed(1)
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)),
'value':np.random.uniform(-1,1,size=50)})
intervals = pd.DataFrame(
{'interval_id':np.arange(9),
'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),
'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)})
def using_loop():
data['interval_id'] = np.nan
for index, ser in intervals.iterrows():
in_interval = (data['time'] >= ser['start_time']) &
(data['time'] <= ser['end_time'])
data['interval_id'][in_interval] = ser['interval_id']
result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True)
return result
def using_searchsorted():
start_idx = np.searchsorted(intervals['start_time'].values, data['time'].values)-1
end_idx = np.searchsorted(intervals['end_time'].values, data['time'].values)
mask = (start_idx == end_idx)
result = data.copy()
result['interval_id'] = result['start_time'] = result['end_time'] = np.nan
result['interval_id'][mask] = start_idx
result.ix[mask, 'start_time'] = intervals['start_time'][start_idx[mask]].values
result.ix[mask, 'end_time'] = intervals['end_time'][end_idx[mask]].values
return result
In [254]: %timeit using_loop()
100 loops, best of 3: 7.74 ms per loop
In [255]: %timeit using_searchsorted()
1000 loops, best of 3: 1.56 ms per loop
In [256]: 7.74/1.56
Out[256]: 4.961538461538462
这篇关于基于不规则时间间隔合并 Pandas DataFrames的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!