为什么大 pandas 使用单维ndarray

本文介绍了为什么大 pandas 使用单维ndarray的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我被动用pandas 滚动功能来执行滚动多因子回归（这个问题是 NOT 关于滚动多因子回归）。我希望能够在 df.rolling（2）之后使用 apply ，并将生成的 pd.DataFrame 用 .values 提取ndarray并执行必要的矩阵乘法。它没有这样做。

这是我发现的：

  import pandas as pd 
 import numpy as np 
 
 np.random.seed（[3,1415]）
 df = pd.DataFrame（np.random。 rand（5,2）.round（2），columns = ['A'，'B']）
 X = np.random.rand（2,1）.round（2）

对象是什么样的：

 打印\\\
df = \ n，df 
打印\\\
X = \ n，X 
打印\\\
df.shape =，df.shape， ，X.shape =，X.shape 
 
 df = 
 AB 
 0 0.44 0.41 
 1 0.46 0.47 
 2 0.46 0.02 
 3 0.85 0.82 
 4 0.78 0.76 
 
 X = 
 [[0.93] 
 [0.83]] 
 
 df.shape =（ 5，2），X.shape =（2L，1L）

矩阵乘法表现正常： p>

  df.values.dot（X）
 
数组（[[0.7495]，
 [ 0.8179]，
 [0.4444]，
 [1.4711]，
 [1.3562]]）

使用apply逐行执行产品的行为与预期相同：

  df.apply（lambda x：x.values.dot（X）[0]，axis = 1）
 
 0 0.7495 
 1 0.8179 
 2 0.4444 
 3 1.4711 
 4 1.3562 
 dtype：float64

Groupby - >应用行为正如我所料：

  df.groupby（level = 0）.apply（lambda x：x.values.dot X）[0,0]）
 
 0 0.7495 
 1 0.8179 
 2 0.4444 
 3 1.4711 
 4 1.3562 
 dtype：float64

但是当我运行时：

  df.rolling（1）.apply（lambda x：x.values.dot（X））

我得到：

好吧，熊猫在其<滚动执行。我可以处理。而不是使用 .values 来获取 ndarray ，我们来试试：

  df.rolling（1）.apply（lambda x：x.dot（X））

lockquote

等等！什么？！

因此，我创建了一个自定义函数来查看滚动正在做什么。
def print_type_sum（x）：
print type（x），x.shape
return x.sum（）

然后运行：

print df.rolling（1）。 apply（print_type_sum）

< type'numpy.ndarray'> （1L，）
< type'numpy.ndarray'> （1L，）
< type'numpy.ndarray'> （1L，）
< type'numpy.ndarray'> （1L，）
< type'numpy.ndarray'> （1L，）
< type'numpy.ndarray'> （1L，）
< type'numpy.ndarray'> （1L，）
< type'numpy.ndarray'> （1L，）
< type'numpy.ndarray'> （1L，）
< type'numpy.ndarray'> （1L）
AB
0 0.44 0.41
1 0.46 0.47
2 0.46 0.02
3 0.85 0.82
4 0.78 0.76

我的结果 pd.DataFrame 是一样的，这很好。但它打印出10个单维 ndarray 对象。关于 rolling（2）
print df.rolling（2） .apply（print_type_sum）

< type'numpy.ndarray'> （2L，）
< type'numpy.ndarray'> （2L，）
< type'numpy.ndarray'> （2L，）
< type'numpy.ndarray'> （2L，）
< type'numpy.ndarray'> （2L，）
< type'numpy.ndarray'> （2L，）
< type'numpy.ndarray'> （2L，）
< type'numpy.ndarray'> （2L）
AB
0 NaN NaN
1 0.90 0.88
2 0.92 0.49
3 1.31 0.84
4 1.63 1.58

同样的事情，期望输出，但它打印8 ndarray 对象。 rolling 产生长度窗口的单维 ndarray ，用于每一列与我期望的相反，这是一个 ndarray 形状（window，len（df.columns））。

问题是为什么？

我现在没有办法轻松地运行滚动多线程，因子回归。
解决方案
使用，这是一个向量化的方法 -
get_sliding_window（df，2）.dot（X）＃window size = 2

运行时测试 -
在[101]中：df = pd.DataFrame（np.random.rand（5，2）.round（2）在[102]中：X = np.array（[2，3]）

在[103]中，列= ['A'，'B']）

：roll_df = roll（df，2）

在[104]中：％timeit rolled_df.apply（lambda df：pd.Series（df.values.dot（X）））
100循环，最好每个循环3：5.51 ms

在[105]中：％timeit get_sliding_window（df，2）.dot（X）
10000循环，最好是3：每循环43.7μs

验证结果 -
In [106]：rolled_df.apply（lambda df：pd.Series（df.values.dot（X）））
Out [106]：
0 1
1 2.70 4.09
2 4.09 2.52
3 2.52 1.78
4 1.78 3.50

在[107]中：get_sliding_window（df，2）.dot（X）
[107]：
数组（[[2.7,4.09]，
[4.09,2.52]，
[2.52,1.78]，
[1.78,3.5]]）

巨大的改进，我希望可以在更大的数组上保持显着！

I was motivated to use pandas rolling feature to perform a rolling multi-factor regression (This question is NOT about rolling multi-factor regression). I expected that I'd be able to use apply after a df.rolling(2) and take the resulting pd.DataFrame extract the ndarray with .values and perform the requisite matrix multiplication. It didn't work out that way.
Here is what I found:
import pandas as pd import numpy as np np.random.seed([3,1415]) df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B']) X = np.random.rand(2, 1).round(2)
What do objects look like:
print "\ndf = \n", df print "\nX = \n", X print "\ndf.shape =", df.shape, ", X.shape =", X.shape df = A B 0 0.44 0.41 1 0.46 0.47 2 0.46 0.02 3 0.85 0.82 4 0.78 0.76 X = [[ 0.93] [ 0.83]] df.shape = (5, 2) , X.shape = (2L, 1L)
Matrix multiplication behaves normally:
df.values.dot(X) array([[ 0.7495], [ 0.8179], [ 0.4444], [ 1.4711], [ 1.3562]])
Using apply to perform row by row dot product behaves as expected:
df.apply(lambda x: x.values.dot(X)[0], axis=1) 0 0.7495 1 0.8179 2 0.4444 3 1.4711 4 1.3562 dtype: float64
Groupby -> Apply behaves as I'd expect:
df.groupby(level=0).apply(lambda x: x.values.dot(X)[0, 0]) 0 0.7495 1 0.8179 2 0.4444 3 1.4711 4 1.3562 dtype: float64
But when I run:
df.rolling(1).apply(lambda x: x.values.dot(X))
I get:
Ok, so pandas is using straight ndarray within its rolling implementation. I can handle that. Instead of using .values to get the ndarray, let's try:
df.rolling(1).apply(lambda x: x.dot(X))
Wait! What?!
So I created a custom function to look at the what rolling is doing.
def print_type_sum(x): print type(x), x.shape return x.sum()
Then ran:
print df.rolling(1).apply(print_type_sum) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) A B 0 0.44 0.41 1 0.46 0.47 2 0.46 0.02 3 0.85 0.82 4 0.78 0.76
My resulting pd.DataFrame is the same, that's good. But it printed out 10 single dimensional ndarray objects. What about rolling(2)
print df.rolling(2).apply(print_type_sum) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) A B 0 NaN NaN 1 0.90 0.88 2 0.92 0.49 3 1.31 0.84 4 1.63 1.58
Same thing, expect output but it printed 8 ndarray objects. rolling is producing a single dimensional ndarray of length window for each column as opposed to what I expected which was an ndarray of shape (window, len(df.columns)).
Question is Why?
I now don't have a way to easily run a rolling multi-factor regression.
解决方案
Using the strides views concept on dataframe, here's a vectorized approach -
get_sliding_window(df, 2).dot(X) # window size = 2
Runtime test -
In [101]: df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B']) In [102]: X = np.array([2, 3]) In [103]: rolled_df = roll(df, 2) In [104]: %timeit rolled_df.apply(lambda df: pd.Series(df.values.dot(X))) 100 loops, best of 3: 5.51 ms per loop In [105]: %timeit get_sliding_window(df, 2).dot(X) 10000 loops, best of 3: 43.7 µs per loop
Verify results -
In [106]: rolled_df.apply(lambda df: pd.Series(df.values.dot(X))) Out[106]: 0 1 1 2.70 4.09 2 4.09 2.52 3 2.52 1.78 4 1.78 3.50 In [107]: get_sliding_window(df, 2).dot(X) Out[107]: array([[ 2.7 , 4.09], [ 4.09, 2.52], [ 2.52, 1.78], [ 1.78, 3.5 ]])
Huge improvement there, which I am hoping would stay noticeable on larger arrays!

这篇关于为什么大 pandas 使用单维ndarray的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！