我有两个看起来像这样的数据集:



我想做的是在“数据”数据框中过滤掉非交易日。我假设它将比较每一行的data.index.date和trading_days的data.index.date,然后如果存在匹配项,则返回该行。如果没有匹配项,则它不是交易日,也不返回该行。这有效地过滤掉了非交易日的数据集。

但是,在这里逐行检查两个data.index.dates是否相等(使用apply()函数返回该行)似乎效率低下-我觉得有一种更有效的方法,因为我会这样做在180M行数据帧上。

是否存在某种“合并”或“联接”,例如:

data.join(trading_days)


那将只过滤date.index.date匹配的日期?我需要在分钟级别上全部记录(如“数据”数据框所示),但只需过滤掉非交易日期即可。谢谢你的帮助!

更新以包括值(请让我知道是否有更好的方法粘贴这些值):

In[5]: data.head(30).values
Out[6]:
array([[ 438.9,  438.9,  438.9,  438.9,    0. ],
       [ 438.9,  438.9,  438.7,  438.7,   31. ],
       [ 438.6,  438.6,  438.6,  438.6,    7. ],
       [ 438.4,  438.7,  438.4,  438.4,    4. ],
       [ 438.4,  438.4,  438.3,  438.3,    4. ],
       [ 438.2,  438.2,  438.2,  438.2,    1. ],
       [ 438.2,  438.2,  438.2,  438.2,    0. ],
       [ 438.2,  438.2,  438.2,  438.2,    1. ],
       [ 438.2,  438.2,  438.2,  438.2,    0. ],
       [ 438.1,  438.1,  438.1,  438.1,    3. ],
       [ 438. ,  438. ,  437.9,  438. ,    6. ],
       [ 438. ,  438.2,  438. ,  438. ,    8. ],
       [ 438.2,  438.2,  438.1,  438.1,    6. ],
       [ 438.1,  438.1,  438.1,  438.1,    4. ],
       [ 438.1,  438.1,  438.1,  438.1,    0. ],
       [ 438.3,  438.3,  438.3,  438.3,    1. ],
       [ 438.3,  438.3,  438.3,  438.3,    0. ],
       [ 438.3,  438.3,  438.3,  438.3,    0. ],
       [ 438.1,  438.1,  438.1,  438.1,    1. ],
       [ 438. ,  438. ,  437.9,  437.9,   54. ],
       [ 437.8,  437.8,  437.8,  437.8,   10. ],
       [ 437.8,  437.8,  437.8,  437.8,    1. ],
       [ 437.8,  437.8,  437.8,  437.8,    6. ],
       [ 437.8,  437.8,  437.8,  437.8,    0. ],
       [ 437.9,  438. ,  437.9,  438. ,   12. ],
       [ 437.9,  438. ,  437.9,  438. ,    0. ],
       [ 437.9,  438. ,  437.9,  438. ,    0. ],
       [ 437.9,  438. ,  437.9,  438. ,    0. ],
       [ 437.9,  437.9,  437.9,  437.9,    1. ],
       [ 437.9,  437.9,  437.8,  437.8,    4. ]])


这是时间戳记:

In[10]: data.head(30).index.values
Out[11]:
array(['2005-01-02T13:59:00.000000000-0500',
       '2005-01-02T14:00:00.000000000-0500',
       '2005-01-02T14:01:00.000000000-0500',
       '2005-01-02T14:02:00.000000000-0500',
       '2005-01-02T14:03:00.000000000-0500',
       '2005-01-02T14:04:00.000000000-0500',
       '2005-01-02T14:05:00.000000000-0500',
       '2005-01-02T14:06:00.000000000-0500',
       '2005-01-02T14:07:00.000000000-0500',
       '2005-01-02T14:08:00.000000000-0500',
       '2005-01-02T14:09:00.000000000-0500',
       '2005-01-02T14:10:00.000000000-0500',
       '2005-01-02T14:11:00.000000000-0500',
       '2005-01-02T14:12:00.000000000-0500',
       '2005-01-02T14:13:00.000000000-0500',
       '2005-01-02T14:14:00.000000000-0500',
       '2005-01-02T14:15:00.000000000-0500',
       '2005-01-02T14:16:00.000000000-0500',
       '2005-01-02T14:17:00.000000000-0500',
       '2005-01-02T14:18:00.000000000-0500',
       '2005-01-02T14:19:00.000000000-0500',
       '2005-01-02T14:20:00.000000000-0500',
       '2005-01-02T14:21:00.000000000-0500',
       '2005-01-02T14:22:00.000000000-0500',
       '2005-01-02T14:23:00.000000000-0500',
       '2005-01-02T14:24:00.000000000-0500',
       '2005-01-02T14:25:00.000000000-0500',
       '2005-01-02T14:26:00.000000000-0500',
       '2005-01-02T14:27:00.000000000-0500',
       '2005-01-02T14:28:00.000000000-0500'], dtype='datetime64[ns]')


而trading_days是来自此处的read.csv:http://pastebin.com/5N01Gi5V

第二次更新:

最佳答案

您可以通过以下方式进行联接:


days列添加到data,其中包含索引的日期。
pd.merge(days, data, on='days')


默认情况下,这会进行内部联接,因此结果中将仅包含data中带有日期的行。

关于python - Python Pandas 仅过滤交易日的时间序列,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/27138198/

10-09 20:19