问题描述
我被动用pandas 滚动功能来执行滚动多因子回归(这个问题是 NOT 关于滚动多因子回归)。我希望能够在 df.rolling(2)之后使用 apply ,并将生成的 pd.DataFrame 用 .values 提取ndarray并执行必要的矩阵乘法。它没有这样做。
这是我发现的:
import pandas as pd
import numpy as np
np.random.seed([3,1415])
df = pd.DataFrame(np.random。 rand(5,2).round(2),columns = ['A','B'])
X = np.random.rand(2,1).round(2)
对象是什么样的:
打印\\\
df = \ n,df
打印\\\
X = \ n,X
打印\\\
df.shape =,df.shape, ,X.shape =,X.shape
df =
AB
0 0.44 0.41
1 0.46 0.47
2 0.46 0.02
3 0.85 0.82
4 0.78 0.76
X =
[[0.93]
[0.83]]
df.shape =( 5,2),X.shape =(2L,1L)
矩阵乘法表现正常: p>
df.values.dot(X)
数组([[0.7495],
[ 0.8179],
[0.4444],
[1.4711],
[1.3562]])
使用apply逐行执行产品的行为与预期相同:
df.apply(lambda x:x.values.dot(X)[0],axis = 1)
0 0.7495
1 0.8179
2 0.4444
3 1.4711
4 1.3562
dtype:float64
Groupby - >应用行为正如我所料:
df.groupby(level = 0).apply(lambda x:x.values.dot X)[0,0])
0 0.7495
1 0.8179
2 0.4444
3 1.4711
4 1.3562
dtype:float64
但是当我运行时:
df.rolling(1).apply(lambda x:x.values.dot(X))
我得到:
好吧,熊猫在其<滚动执行。我可以处理。而不是使用 .values 来获取 ndarray ,我们来试试:
df.rolling(1).apply(lambda x:x.dot(X))
lockquote
等等!什么?!
因此,我创建了一个自定义函数来查看滚动正在做什么。
def print_type_sum(x):
print type(x),x.shape
return x.sum()
然后运行:
print df.rolling(1)。 apply(print_type_sum)
< type'numpy.ndarray'> (1L,)
< type'numpy.ndarray'> (1L,)
< type'numpy.ndarray'> (1L,)
< type'numpy.ndarray'> (1L,)
< type'numpy.ndarray'> (1L,)
< type'numpy.ndarray'> (1L,)
< type'numpy.ndarray'> (1L,)
< type'numpy.ndarray'> (1L,)
< type'numpy.ndarray'> (1L,)
< type'numpy.ndarray'> (1L)
AB
0 0.44 0.41
1 0.46 0.47
2 0.46 0.02
3 0.85 0.82
4 0.78 0.76
我的结果 pd.DataFrame 是一样的,这很好。但它打印出10个单维 ndarray 对象。关于 rolling(2)
print df.rolling(2) .apply(print_type_sum)
< type'numpy.ndarray'> (2L,)
< type'numpy.ndarray'> (2L,)
< type'numpy.ndarray'> (2L,)
< type'numpy.ndarray'> (2L,)
< type'numpy.ndarray'> (2L,)
< type'numpy.ndarray'> (2L,)
< type'numpy.ndarray'> (2L,)
< type'numpy.ndarray'> (2L)
AB
0 NaN NaN
1 0.90 0.88
2 0.92 0.49
3 1.31 0.84
4 1.63 1.58
同样的事情,期望输出,但它打印8 ndarray 对象。 rolling 产生长度窗口的单维 ndarray ,用于每一列与我期望的相反,这是一个 ndarray 形状(window,len(df.columns)) 。
问题是为什么?
我现在没有办法轻松地运行滚动多线程,因子回归。
使用,这是一个向量化的方法 -
get_sliding_window(df,2).dot(X)#window size = 2
运行时测试 -
在[101]中:df = pd.DataFrame(np.random.rand(5,2).round(2)在[102]中:X = np.array([2,3])
在[103]中,列= ['A','B'])
:roll_df = roll(df,2)
在[104]中:%timeit rolled_df.apply(lambda df:pd.Series(df.values.dot(X)))
100循环,最好每个循环3:5.51 ms
在[105]中:%timeit get_sliding_window(df,2).dot(X)
10000循环,最好是3:每循环43.7μs
验证结果 -
In [106]:rolled_df.apply(lambda df:pd.Series(df.values.dot(X)))
Out [106]:
0 1
1 2.70 4.09
2 4.09 2.52
3 2.52 1.78
4 1.78 3.50
在[107]中:get_sliding_window(df,2).dot(X)
[107]:
数组([[2.7,4.09],
[4.09,2.52],
[2.52,1.78],
[1.78,3.5]])
巨大的改进,我希望可以在更大的数组上保持显着!
I was motivated to use pandas rolling feature to perform a rolling multi-factor regression (This question is NOT about rolling multi-factor regression). I expected that I'd be able to use apply after a df.rolling(2) and take the resulting pd.DataFrame extract the ndarray with .values and perform the requisite matrix multiplication. It didn't work out that way.
Here is what I found:
import pandas as pd import numpy as np np.random.seed([3,1415]) df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B']) X = np.random.rand(2, 1).round(2)
What do objects look like:
print "\ndf = \n", df print "\nX = \n", X print "\ndf.shape =", df.shape, ", X.shape =", X.shape df = A B 0 0.44 0.41 1 0.46 0.47 2 0.46 0.02 3 0.85 0.82 4 0.78 0.76 X = [[ 0.93] [ 0.83]] df.shape = (5, 2) , X.shape = (2L, 1L)
Matrix multiplication behaves normally:
df.values.dot(X) array([[ 0.7495], [ 0.8179], [ 0.4444], [ 1.4711], [ 1.3562]])
Using apply to perform row by row dot product behaves as expected:
df.apply(lambda x: x.values.dot(X)[0], axis=1) 0 0.7495 1 0.8179 2 0.4444 3 1.4711 4 1.3562 dtype: float64
Groupby -> Apply behaves as I'd expect:
df.groupby(level=0).apply(lambda x: x.values.dot(X)[0, 0]) 0 0.7495 1 0.8179 2 0.4444 3 1.4711 4 1.3562 dtype: float64
But when I run:
df.rolling(1).apply(lambda x: x.values.dot(X))
I get:
Ok, so pandas is using straight ndarray within its rolling implementation. I can handle that. Instead of using .values to get the ndarray, let's try:
df.rolling(1).apply(lambda x: x.dot(X))
Wait! What?!
So I created a custom function to look at the what rolling is doing.
def print_type_sum(x): print type(x), x.shape return x.sum()
Then ran:
print df.rolling(1).apply(print_type_sum) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) A B 0 0.44 0.41 1 0.46 0.47 2 0.46 0.02 3 0.85 0.82 4 0.78 0.76
My resulting pd.DataFrame is the same, that's good. But it printed out 10 single dimensional ndarray objects. What about rolling(2)
print df.rolling(2).apply(print_type_sum) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) A B 0 NaN NaN 1 0.90 0.88 2 0.92 0.49 3 1.31 0.84 4 1.63 1.58
Same thing, expect output but it printed 8 ndarray objects. rolling is producing a single dimensional ndarray of length window for each column as opposed to what I expected which was an ndarray of shape (window, len(df.columns)).
Question is Why?
I now don't have a way to easily run a rolling multi-factor regression.
Using the strides views concept on dataframe, here's a vectorized approach -
get_sliding_window(df, 2).dot(X) # window size = 2
Runtime test -
In [101]: df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B']) In [102]: X = np.array([2, 3]) In [103]: rolled_df = roll(df, 2) In [104]: %timeit rolled_df.apply(lambda df: pd.Series(df.values.dot(X))) 100 loops, best of 3: 5.51 ms per loop In [105]: %timeit get_sliding_window(df, 2).dot(X) 10000 loops, best of 3: 43.7 µs per loop
Verify results -
In [106]: rolled_df.apply(lambda df: pd.Series(df.values.dot(X))) Out[106]: 0 1 1 2.70 4.09 2 4.09 2.52 3 2.52 1.78 4 1.78 3.50 In [107]: get_sliding_window(df, 2).dot(X) Out[107]: array([[ 2.7 , 4.09], [ 4.09, 2.52], [ 2.52, 1.78], [ 1.78, 3.5 ]])
Huge improvement there, which I am hoping would stay noticeable on larger arrays!
这篇关于为什么大 pandas 使用单维ndarray的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!