问题描述
我是熊猫新手.
我有一个非常简单的名为dlf
的数据框,带有索引和两行40k行.它是这样加载的:
I have a very simple dataframe named dlf
with an index and two columns with 40k-row. It is loaded as so:
d = pd.DataFrame.from_csv(csvsLocation + 'name.csv', index_col='ID', infer_datetime_format=True)
d['LAST'] = pd.to_datetime(d['LAST'], format = '%d-%b-%y')
d['FIRST'] = pd.to_datetime(d['FIRST'], format = '%d-%b-%y')
dlf = d[['LAST', 'FIRST']]
它看起来像这样:
LAST FIRST
ID
1 1997-04-17 1991-10-04
3 2009-02-13 1988-07-07
5 2009-10-24 1995-12-06
6 1996-04-31 1989-03-14
运行此套用方法需要5秒钟:
year = 1997
dlf[str(year)] = dlf.apply(lambda row: 1*(year >= row['FIRST'].year and year <= row['LAST'].year), axis=1)
我需要加快速度,因为我打算运行数百次.
I need this sped up because I intend to run it hundreds of times.
我怀疑问题出在使用lambda.
I suspect the issue is in using lambda.
我做错了什么,和/或如何加快速度?
What have I done wrong, and/or how can I speed it up?
推荐答案
解决方案
您可以通过两个日期列上的dt.year
访问年份:
year = 1999
df[str(year)] = 1 * ((df['FIRST'].dt.year <= year) & (df['LAST'].dt.year >= year))
print(df)
输出:
LAST FIRST 1999
ID
1 1997-04-17 1991-10-14 0
3 2009-02-13 1988-07-07 1
5 2009-10-24 1995-10-06 1
6 1996-04-30 1969-03-14 0
您还可以保留布尔值作为结果:
You can also keep the boolean as result:
df[str(year)] = (df['FIRST'].dt.year <= year) & (df['LAST'].dt.year >= year)
print(df)
输出:
LAST FIRST 1999
ID
1 1997-04-17 1991-10-14 False
3 2009-02-13 1988-07-07 True
5 2009-10-24 1995-10-06 True
6 1996-04-30 1969-03-14 False
性能
测量性能始终很有趣.但是测量可能很棘手.如果我们仅使用带有4行的微小示例数据帧,事情就会变慢:
Performance
Measuring performance is always fun. But measuring can be tricky. If we just use our tiny example dataframe with 4 rows, things get a bit slower:
%timeit dlf[str(year)] = dlf.apply(lambda row: 1*(year >= row['FIRST'].year and year <= row['LAST'].year), axis=1)
1000 loops, best of 3: 1.27 ms per loop
%timeit df[str(year)] = 1 * ((df['FIRST'].dt.year <= year) & (df['LAST'].dt.year >= year))
100 loops, best of 3: 1.7 ms per loop
但是让我们看一下4万行:
But let's have a look at 40k rows:
big = pd.concat([df] * 10000)
>>> big.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 1 to 6
Data columns (total 4 columns):
LAST 40000 non-null datetime64[ns]
FIRST 40000 non-null datetime64[ns]
1999 40000 non-null bool
1997 40000 non-null int64
dtypes: bool(1), datetime64[ns](2), int64(1)
memory usage: 1.3 MB
现在我们可以看到明显的加速:
Now we can see a significant speedup:
%timeit big[str(year)] = big.apply(lambda row: 1*(year >= row['FIRST'].year and year <= row['LAST'].year), axis=1)
1 loops, best of 3: 6.51 s per loop
%timeit big[str(year)] = 1 * ((big['FIRST'].dt.year <= year) & (big['LAST'].dt.year >= year))
100 loops, best of 3: 8.33 ms per loop
这大约快780倍.
这篇关于如何使用日期时间在 pandas 中加快Lambda的应用方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!