问题描述
我正在尝试执行以下操作,但似乎不支持此模式下的矢量化操作.
I am trying to do the following but is seems that vectorized operations in this mode are not supported.
import pandas as pd
df=pd.DataFrame([[2017,1,15,1],
[2017,1,15,2],
[2017,1,15,3],
[2017,1,15,4],
[2017,1,15,5],
[2017,1,15,6],
[2017,1,15,7]],
columns=['year','month','day','month_offset'])
df['date']=df.apply(lambda g: pd.datetime(g.year,g.month,g.day),axis=1)
df['offset']=df.apply(lambda g: pd.offsets.MonthEnd(g.month_offset),axis=1)
df['date_offset']=df.date+df.offset
这是代码段中最后一条语句返回的警告:
This is the warning returned for last statement in the code snippet:
出于性能方面的考虑,我希望将此操作作为矢量化操作.
I would like to this to work as a vectorized operation because of the performance benefits.
谢谢.
最后,对@ john-zwinck后面的方法进行比较:
To end, comparison of methods following on from @john-zwinck:
import time
import pandas as pd
import numpy as np
df=pd.DataFrame([[2017,1,1,1],
[2017,1,1,2],
[2017,1,1,3],
[2017,1,1,4],
[2017,1,1,5],
[2017,1,1,6],
[2017,1,1,7]],
columns=['year','month','day','month_offset'])
df['mydate']=df.apply(lambda g:
pd.datetime(g.year,g.month,g.day),axis=1)
start_time=time.time()
df['pandas_offset']=df.apply(lambda g: g.mydate +
pd.offsets.MonthEnd(g.month_offset),axis=1)
end_time=time.time()
print('Method1 {} seconds'.format(end_time-start_time))
start_time=time.time()
df['numpy_offset']=(df.mydate.values.astype('M8[M]')+
df.month_offset.values * np.timedelta64(1, 'M')).astype('M8[D]') -
np.timedelta64(1, 'D')
end_time=time.time()
print('Method3 with numpy vectorization {} seconds'.format(end_time-
start_time))
结果:
index year month day month_offset mydate offset1 final
0 2017 1 1 1 2017-01-01 2017-01-31 2017-01-31
1 2017 1 1 2 2017-01-01 2017-02-28 2017-02-28
2 2017 1 1 3 2017-01-01 2017-03-31 2017-03-31
3 2017 1 1 4 2017-01-01 2017-04-30 2017-04-30
4 2017 1 1 5 2017-01-01 2017-05-31 2017-05-31
5 2017 1 1 6 2017-01-01 2017-06-30 2017-06-30
6 2017 1 1 7 2017-01-01 2017-07-31 2017-07-31
runfile('C:/bitbucket/test/vector_dates.py', wdir='C:/bitbucket/test')
Method 1 0.003999948501586914 seconds
Method 2 with numpy vectorization 0.0009999275207519531 seconds
明显的numpy快得多
Clearly numpy much faster
推荐答案
一种真正的矢量化方法是从month_offset
构造一个numpy.timedelta64
数组,将其添加到日期数组中,然后减去返回上个月的最后一天.
A truly vectorized way to do this is to construct an array of numpy.timedelta64
from month_offset
, add this to the array of dates, then subtract numpy.timedelta64(1, 'D')
to go back to the last day of the previous month.
使用apply(lambda)
的解决方案可能要慢得多.并且如警告所述,某些熊猫的日期偏移量操作未向量化.如果您的数据很大,最好避免使用它们.像busday_offset()
和timedelta64
这样的NumPy设施都表现出色.
Solutions using apply(lambda)
are likely to be much slower. And as the warning said, some Pandas date offset operations are not vectorized. If your data are large, it's better to avoid them. The NumPy facilities like busday_offset()
and timedelta64
are fully performant.
这篇关于具有不同偏移量矢量的 pandas 矢量化日期偏移量操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!