问题描述
我最近偶然发现了一个很棒的新 pendulum
库,以便更轻松地处理日期时间.
I've recently stumbled upon a new awesome pendulum
library for easier work with datetimes.
在pandas
中,有一个方便的 to_datetime()
方法允许将系列和其他对象转换为日期时间:
In pandas
, there is this handy to_datetime()
method allowing to convert series and other objects to datetimes:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
创建自定义to_<something>
方法的规范方法是什么- 在这种情况下,可以使用to_pendulum()
方法将一系列日期字符串直接转换为 Pendulum
对象?
What would be the canonical way to create a custom to_<something>
method - in this case to_pendulum()
method which would be able to convert Series of date strings directly to Pendulum
objects?
这可能会导致Series
具有各种有趣的功能,例如,将一系列日期字符串转换为一系列从现在开始的偏移量"-人类日期时间差异.
This may lead to Series
having various interesting capabilities like, for instance, converting a series of date strings to a series of "offsets from now" - human datetime diffs.
推荐答案
在仔细浏览了一下API之后,我必须说我对他们所做的事情印象深刻.不幸的是,我认为Pendulum
和pandas
不能一起工作(至少在当前最新版本-v0.21
中).
After looking through the API a bit, I must say I'm impressed with what they've done. Unfortunately, I don't think Pendulum
and pandas
can work together (at least, with the current latest version - v0.21
).
最重要的原因是pandas
本身不支持Pendulum
作为数据类型.所有本机支持的数据类型(np.int
,np.float
和np.datetime64
)都支持某种形式的矢量化.使用数据框(例如,普通循环和列表)将不会丝毫提高性能.如果有的话,用Pendulum
对象在Series
上调用apply
会更慢(因为所有API开销).
The most important reason is that pandas
does not natively support Pendulum
as a datatype. All the natively supported datatypes (np.int
, np.float
and np.datetime64
) all support vectorisation in some form. You are not going to get a shred of performance improvement using a dataframe over, say, a vanilla loop and list. If anything, calling apply
on a Series
with Pendulum
objects is going to be slower (because of all the API overheads).
另一个原因是Pendulum
是datetime
-
from datetime import datetime
isinstance(pendulum.now(), datetime)
True
这很重要,因为如上所述,datetime
是受支持的数据类型,因此熊猫会尝试将datetime
强制转换为熊猫的本机日期时间格式-Timestamp
.这是一个例子.
This is important, because, as mentioned above, datetime
is a supported datatype, so pandas will attempt to coerce datetime
to pandas' native datetime format - Timestamp
. Here's an example.
print(s)
0 2017-11-09 18:43:45
1 2017-11-09 20:15:27
2 2017-11-09 22:29:00
3 2017-11-09 23:42:34
4 2017-11-10 00:09:40
5 2017-11-10 00:23:14
6 2017-11-10 03:32:17
7 2017-11-10 10:59:24
8 2017-11-10 11:12:59
9 2017-11-10 13:49:09
s = s.apply(pendulum.parse)
s
0 2017-11-09 18:43:45+00:00
1 2017-11-09 20:15:27+00:00
2 2017-11-09 22:29:00+00:00
3 2017-11-09 23:42:34+00:00
4 2017-11-10 00:09:40+00:00
5 2017-11-10 00:23:14+00:00
6 2017-11-10 03:32:17+00:00
7 2017-11-10 10:59:24+00:00
8 2017-11-10 11:12:59+00:00
9 2017-11-10 13:49:09+00:00
Name: timestamp, dtype: datetime64[ns, <TimezoneInfo [UTC, GMT, +00:00:00, STD]>]
s[0]
Timestamp('2017-11-09 18:43:45+0000', tz='<TimezoneInfo [UTC, GMT, +00:00:00, STD]>')
type(s[0])
pandas._libs.tslib.Timestamp
因此,有些困难(涉及到dtype=object
),您可以将Pendulum
对象加载到数据帧中.这是您的处理方式-
So, with some difficulty (involving dtype=object
), you could load Pendulum
objects into dataframes. Here's how you'd do that -
v = np.vectorize(pendulum.parse)
s = pd.Series(v(s), dtype=object)
s
0 2017-11-09T18:43:45+00:00
1 2017-11-09T20:15:27+00:00
2 2017-11-09T22:29:00+00:00
3 2017-11-09T23:42:34+00:00
4 2017-11-10T00:09:40+00:00
5 2017-11-10T00:23:14+00:00
6 2017-11-10T03:32:17+00:00
7 2017-11-10T10:59:24+00:00
8 2017-11-10T11:12:59+00:00
9 2017-11-10T13:49:09+00:00
s[0]
<Pendulum [2017-11-09T18:43:45+00:00]>
但是,这实际上是没有用的,因为调用 any pendulum
方法(通过apply
)现在不仅会非常慢,而且最终结果会被强制为Timestamp
再次徒劳无功.
However, this is essentially useless, because calling any pendulum
method (via apply
) will now not only be super slow, but will also end up in the result being coerced to Timestamp
again, an exercise in futility.
这篇关于使 pandas 与摆锤一起工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!