问题描述
我的动机是使用 pandas rolling
功能来执行滚动多因子回归(这个问题不是关于滚动多因子回归).我希望我能够在 df.rolling(2)
之后使用 apply
并获取结果 pd.DataFrame
提取 ndarray使用 .values
并执行必要的矩阵乘法.结果不是这样.
这是我发现的:
将pandas导入为pd将 numpy 导入为 npnp.random.seed([3,1415])df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])X = np.random.rand(2, 1).round(2)
物体是什么样子的:
print "
df =
", df打印 "
X =
", X打印 "
df.shape =", df.shape, ", X.shape =", X.shapedf =甲乙0 0.44 0.411 0.46 0.472 0.46 0.023 0.85 0.824 0.78 0.76X =[[0.93][0.83]]df.shape = (5, 2) , X.shape = (2L, 1L)
矩阵乘法正常:
df.values.dot(X)数组([[ 0.7495],[0.8179],[0.4444],[1.4711],[1.3562]])
使用 apply 执行逐行点积的行为符合预期:
df.apply(lambda x: x.values.dot(X)[0], axis=1)0 0.74951 0.81792 0.44443 1.47114 1.3562数据类型:float64
Groupby -> Apply 的行为符合我的预期:
df.groupby(level=0).apply(lambda x: x.values.dot(X)[0, 0])0 0.74951 0.81792 0.44443 1.47114 1.3562数据类型:float64
但是当我跑步时:
df.rolling(1).apply(lambda x: x.values.dot(X))
我明白了:
AttributeError: 'numpy.ndarray' 对象没有属性 'values'
好的,所以 Pandas 在它的 rolling
实现中使用了直接的 ndarray
.我可以处理.我们不使用 .values
来获取 ndarray
,而是尝试:
df.rolling(1).apply(lambda x: x.dot(X))
形状 (1,) 和 (2,1) 未对齐:1 (dim 0) != 2 (dim 0)
等等!什么?!
所以我创建了一个自定义函数来查看滚动正在做什么.
def print_type_sum(x):打印类型(x),x.shape返回 x.sum()
然后跑:
print df.rolling(1).apply(print_type_sum)(1L,)(1L,)(1L,)(1L,)(1L,)(1L,)(1L,)(1L,)(1L,)(1L,)甲乙0 0.44 0.411 0.46 0.472 0.46 0.023 0.85 0.824 0.78 0.76
我得到的 pd.DataFrame
是一样的,这很好.但它打印出 10 个一维 ndarray
对象.rolling(2)
print df.rolling(2).apply(print_type_sum)(2L,)(2L,)(2L,)(2L,)(2L,)(2L,)(2L,)(2L,)甲乙0 南南1 0.90 0.882 0.92 0.493 1.31 0.844 1.63 1.58
同样的事情,期待输出,但它打印了 8 个 ndarray
对象.rolling
正在为每一列生成一个长度为 window
的单维 ndarray
,而不是我期望的 ndarray
> 形状 (window, len(df.columns))
.
问题是为什么?
我现在没有办法轻松运行滚动多因素回归.
使用strides views concept on dataframe
,这是一种矢量化方法 -
get_sliding_window(df, 2).dot(X) # 窗口大小 = 2
运行时测试 -
在 [101]: df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])在 [102] 中:X = np.array([2, 3])在 [103] 中:rolled_df = roll(df, 2)在 [104]: %timeitrolled_df.apply(lambda df: pd.Series(df.values.dot(X)))100 个循环,最好的 3 个:每个循环 5.51 毫秒在 [105]: %timeit get_sliding_window(df, 2).dot(X)10000 个循环,最好的 3 个:每个循环 43.7 µs
验证结果 -
在[106]:rolled_df.apply(lambda df: pd.Series(df.values.dot(X)))出[106]:0 11 2.70 4.092 4.09 2.523 2.52 1.784 1.78 3.50在 [107] 中:get_sliding_window(df, 2).dot(X)出[107]:数组([[ 2.7 , 4.09],[4.09, 2.52],[2.52, 1.78],[ 1.78, 3.5 ]])
那里有巨大的改进,我希望在更大的阵列上会保持明显!
I was motivated to use pandas rolling
feature to perform a rolling multi-factor regression (This question is NOT about rolling multi-factor regression). I expected that I'd be able to use apply
after a df.rolling(2)
and take the resulting pd.DataFrame
extract the ndarray with .values
and perform the requisite matrix multiplication. It didn't work out that way.
Here is what I found:
import pandas as pd
import numpy as np
np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])
X = np.random.rand(2, 1).round(2)
What do objects look like:
print "
df =
", df
print "
X =
", X
print "
df.shape =", df.shape, ", X.shape =", X.shape
df =
A B
0 0.44 0.41
1 0.46 0.47
2 0.46 0.02
3 0.85 0.82
4 0.78 0.76
X =
[[ 0.93]
[ 0.83]]
df.shape = (5, 2) , X.shape = (2L, 1L)
Matrix multiplication behaves normally:
df.values.dot(X)
array([[ 0.7495],
[ 0.8179],
[ 0.4444],
[ 1.4711],
[ 1.3562]])
Using apply to perform row by row dot product behaves as expected:
df.apply(lambda x: x.values.dot(X)[0], axis=1)
0 0.7495
1 0.8179
2 0.4444
3 1.4711
4 1.3562
dtype: float64
Groupby -> Apply behaves as I'd expect:
df.groupby(level=0).apply(lambda x: x.values.dot(X)[0, 0])
0 0.7495
1 0.8179
2 0.4444
3 1.4711
4 1.3562
dtype: float64
But when I run:
df.rolling(1).apply(lambda x: x.values.dot(X))
I get:
Ok, so pandas is using straight ndarray
within its rolling
implementation. I can handle that. Instead of using .values
to get the ndarray
, let's try:
df.rolling(1).apply(lambda x: x.dot(X))
Wait! What?!
So I created a custom function to look at the what rolling is doing.
def print_type_sum(x):
print type(x), x.shape
return x.sum()
Then ran:
print df.rolling(1).apply(print_type_sum)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
A B
0 0.44 0.41
1 0.46 0.47
2 0.46 0.02
3 0.85 0.82
4 0.78 0.76
My resulting pd.DataFrame
is the same, that's good. But it printed out 10 single dimensional ndarray
objects. What about rolling(2)
print df.rolling(2).apply(print_type_sum)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
A B
0 NaN NaN
1 0.90 0.88
2 0.92 0.49
3 1.31 0.84
4 1.63 1.58
Same thing, expect output but it printed 8 ndarray
objects. rolling
is producing a single dimensional ndarray
of length window
for each column as opposed to what I expected which was an ndarray
of shape (window, len(df.columns))
.
Question is Why?
I now don't have a way to easily run a rolling multi-factor regression.
Using the strides views concept on dataframe
, here's a vectorized approach -
get_sliding_window(df, 2).dot(X) # window size = 2
Runtime test -
In [101]: df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])
In [102]: X = np.array([2, 3])
In [103]: rolled_df = roll(df, 2)
In [104]: %timeit rolled_df.apply(lambda df: pd.Series(df.values.dot(X)))
100 loops, best of 3: 5.51 ms per loop
In [105]: %timeit get_sliding_window(df, 2).dot(X)
10000 loops, best of 3: 43.7 µs per loop
Verify results -
In [106]: rolled_df.apply(lambda df: pd.Series(df.values.dot(X)))
Out[106]:
0 1
1 2.70 4.09
2 4.09 2.52
3 2.52 1.78
4 1.78 3.50
In [107]: get_sliding_window(df, 2).dot(X)
Out[107]:
array([[ 2.7 , 4.09],
[ 4.09, 2.52],
[ 2.52, 1.78],
[ 1.78, 3.5 ]])
Huge improvement there, which I am hoping would stay noticeable on larger arrays!
这篇关于为什么 pandas 滚动使用一维ndarray的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!