问题描述
我有一个时间序列,不被识别为DatetimeIndex,尽管被标准的YYYY-MM-DD字符串和有效的日期索引。强制他们到有效的DatetimeIndex似乎足够让我觉得我做错了。
I have a time-series that is not recognized as a DatetimeIndex despite being indexed by standard YYYY-MM-DD strings with valid dates. Coercing them to a valid DatetimeIndex seems to be inelegant enough to make me think I'm doing something wrong.
我读入(别人的懒惰格式)包含无效的数据日期时间值,并删除这些无效的观察。
I read in (someone else's lazily formatted) data that contains invalid datetime values and remove these invalid observations.
In [1]: df = pd.read_csv('data.csv',index_col=0)
In [2]: print df['2008-02-27':'2008-03-02']
Out[2]:
count
2008-02-27 20
2008-02-28 0
2008-02-29 27
2008-02-30 0
2008-02-31 0
2008-03-01 0
2008-03-02 17
In [3]: def clean_timestamps(df):
# remove invalid dates like '2008-02-30' and '2009-04-31'
to_drop = list()
for d in df.index:
try:
datetime.date(int(d[0:4]),int(d[5:7]),int(d[8:10]))
except ValueError:
to_drop.append(d)
df2 = df.drop(to_drop,axis=0)
return df2
In [4]: df2 = clean_timestamps(df)
In [5] :print df2['2008-02-27':'2008-03-02']
Out[5]:
count
2008-02-27 20
2008-02-28 0
2008-02-29 27
2008-03-01 0
2008-03-02 17
此新索引仍被视为对象dtype而不是DatetimeIndex。
This new index is still only recognized as a 'object' dtype rather than a DatetimeIndex.
In [6]: df2.index
Out[6]: Index([2008-01-01, 2008-01-02, 2008-01-03, ..., 2012-11-27, 2012-11-28,
2012-11-29], dtype=object)
Reindexing生成NaN,因为它们是不同的dtypes。
Reindexing produces NaNs because they're different dtypes.
In [7]: i = pd.date_range(start=min(df2.index),end=max(df2.index))
In [8]: df3 = df2.reindex(index=i,columns=['count'])
In [9]: df3['2008-02-27':'2008-03-02']
Out[9]:
count
2008-02-27 NaN
2008-02-28 NaN
2008-02-29 NaN
2008-03-01 NaN
2008-03-02 NaN
我使用适当的索引创建一个新的数据框,将数据放入字典,然后根据字典值(跳过缺失值)填充新的数据框。
I create a fresh dataframe with the appropriate index, drop the data to a dictionary, then populate the new dataframe based on the dictionary values (skipping missing values).
In [10]: df3 = pd.DataFrame(columns=['count'],index=i)
In [11]: values = dict(df2['count'])
In [12]: for d in i:
try:
df3.set_value(index=d,col='count',value=values[d.isoformat()[0:10]])
except KeyError:
pass
In [13]: print df3['2008-02-27':'2008-03-02']
Out[13]:
count
2008-02-27 20
2008-02-28 0
2008-02-29 27
2008-03-01 0
2008-03-02 17
In [14]: df3.index
Out[14];
<class 'pandas.tseries.index.DatetimeIndex'>
[2008-01-01 00:00:00, ..., 2012-11-29 00:00:00]
Length: 1795, Freq: D, Timezone: None
最后一部分基于查找字符串的字符串设置值似乎特别骇人听闻,让我觉得我有错过了一些重要的东西。
This last part of setting values based on lookups to a dictionary keyed by strings seems especially hacky and makes me think I've missed something important.
推荐答案
你可以使用 pd.to_datetime
/ p>
You could use pd.to_datetime
:
In [1]: import pandas as pd
In [2]: pd.to_datetime('2008-02-27')
Out[2]: datetime.datetime(2008, 2, 27, 0, 0)
这允许您通过将其应用于Series来清理索引(或类似的列):
This allows you to "clean" the index (or similarly a column) by applying it to the Series:
df.index = pd.to_datetime(df.index)
或
df['date_col'] = df['date_col'].apply(pd.to_datetime)
这篇关于从对象dtype到datetime dtype重新编号大 pandas 时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!