python - 通过Pandas GroupBy函数(和建议的其他方法)为SVM创建特征(行)向量

我正在尝试为SVM分类器创建堆叠的特征向量。我所有的数据都放在一个大矩阵中。当前的问题是多类分类问题，因此我需要使用多索引进行分组。

这是我要实现的目标的一个玩具示例。

N = 4
col_ids = string.letters[:N]
df = pd.DataFrame(
      np.random.randint(10, size=(16,N)),       #np.random.randn(1,N),
      columns=['col_{}'.format(letter) for letter in col_ids])

test_cols = ['test1','test1','test1','test1','test1','test1','test1','test1','test2','test2','test2','test2','test2','test2','test2','test2']
test_iter = [1,1,1,1,2,2,2,2,1,1,1,1,2,2,2,2]

df.insert(0, 'Activity', test_cols)
df.insert(1, 'Iteration', test_iter)

输出：

   Activity  Iteration  col_A  col_B  col_C  col_D
0     test1          1      7      2      9      7
1     test1          1      9      7      2      7
2     test1          1      4      4      2      2
3     test1          1      0      1      0      6
4     test1          2      3      5      3      3
5     test1          2      9      5      7      6
6     test1          2      9      5      8      6
7     test1          2      9      7      9      1
8     test2          1      3      2      5      5
9     test2          1      8      5      9      0
10    test2          1      8      6      3      9
11    test2          1      3      9      2      5
12    test2          2      0      4      4      1
13    test2          2      7      0      4      6
14    test2          2      5      4      0      9
15    test2          2      0      0      5      0

我使用以下groupby来获取适合我的应用程序的组：

g = df.groupby(["Activity", "Iteration"])

                      Activity  Iteration  col_A  col_B  col_C  col_D
Activity   Iteration
test1    1         0     test1          1      7      2      9      7
                   1     test1          1      9      7      2      7
                   2     test1          1      4      4      2      2
                   3     test1          1      0      1      0      6
         2         4     test1          2      3      5      3      3
                   5     test1          2      9      5      7      6
                   6     test1          2      9      5      8      6
                   7     test1          2      9      7      9      1
test2    1         8     test2          1      3      2      5      5
                   9     test2          1      8      5      9      0
                   10    test2          1      8      6      3      9
                   11    test2          1      3      9      2      5
         2         12    test2          2      0      4      4      1
                   13    test2          2      7      0      4      6
                   14    test2          2      5      4      0      9
                   15    test2          2      0      0      5      0

现在，我想创建特征向量并将其存储到新的DataFrame中，但是只使用两行来创建一个特征向量。这意味着在测试示例中，test1活动被执行两次，每次迭代具有相同的标签，因此在这种情况下，它具有两个标签：1和2。应从每个标签中堆叠两行以创建所需的输出。

我想从test1创建四个行向量，以便完整的输出（理想情况下）如下所示：

test1 test1 ... test2
    7     4         5
    2     4         4
    9     2         0
    7     2         9
    9     0         0
    7     1         0
    2     0         5
    7     6         9

我没有写完整的东西，但我希望我能实现的目标很明显。基本上;两行成为一个堆叠的行向量（标签在顶部），同一向量是一个特征向量。由于我有多个活动，因此每个活动需要多个特征向量来训练SVM。对于此示例，理想情况下，我将获得一个带有八个特征行向量的pd.DataFrame，以便将数据帧从（16,4）调整为（8 ，8）。

据我所知，这并没有令人惊讶地解释，因此，如果您需要进一步的详细信息，请随时告诉我们。

谢谢。

最佳答案

您需要将一个函数传递给groupby，该函数为最终输出准备数据，然后重新标记列，如下所示：

def f(x):
    values = [v for vals in x.values for v in vals[2:]]
    return pd.Series(values,name=x.values[0][0])

res = df.groupby(["Activity", "Iteration"]).apply(f)
res = res.T.rename(columns={(t,i):t for t,i in res.index})
print df
print res

在我的测试输出中：（请注意数据是随机的！）

   Activity  Iteration  col_A  col_B  col_C  col_D
0     test1          1      4      6      5      7
1     test1          1      5      9      5      4
2     test1          1      1      8      7      9
3     test1          1      4      8      1      9
4     test1          2      4      5      5      6
5     test1          2      6      3      8      6
6     test1          2      8      1      1      2
7     test1          2      5      1      8      1
8     test2          1      6      3      9      9
9     test2          1      4      9      9      7
10    test2          1      5      0      1      3
11    test2          1      5      8      9      5
12    test2          2      4      8      3      2
13    test2          2      8      9      9      4
14    test2          2      6      1      1      8
15    test2          2      6      4      4      8
    test1  test1  test2  test2
0       4      4      6      4
1       6      5      3      8
2       5      5      9      3
3       7      6      9      2
4       5      6      4      8
5       9      3      9      9
6       5      8      9      9
7       4      6      7      4
8       1      8      5      6
9       8      1      0      1
10      7      1      1      1
11      9      2      3      8
12      4      5      5      6
13      8      1      8      4
14      1      8      9      4
15      9      1      5      8

每个测试中的2列8个元素比较棘手，但是您可以使用相同的方法进行操作：

def g(x):
    values = [v for vals in x.values for v in vals[2:]]
    return pd.DataFrame({1: values[:N/2*len(x)], 2: values[N/2*len(x):]})

res = df.groupby(["Activity", "Iteration"]).apply(g).unstack()
r1 = res[1].T.rename(columns={(t,i):t+str(i)+"1" for t,i in res.index})
r2 = res[2].T.rename(columns={(t,i):t+str(i)+"2" for t,i in res.index})
res = pd.concat([r1,r2],axis=1).sort(axis=1)
res = res.rename(columns={t:t[:-2] for t in res.columns})

print df
print res

哪个输出：

   Activity  Iteration  col_A  col_B  col_C  col_D
0     test1          1      0      8      1      7
1     test1          1      2      0      5      0
2     test1          1      2      6      6      6
3     test1          1      5      0      1      4
4     test1          2      4      5      6      8
5     test1          2      8      0      1      6
6     test1          2      6      7      2      4
7     test1          2      3      2      2      3
8     test2          1      5      2      1      9
9     test2          1      8      3      5      9
10    test2          1      3      7      7      1
11    test2          1      7      4      5      1
12    test2          2      9      2      4      0
13    test2          2      3      1      8      7
14    test2          2      1      2      7      8
15    test2          2      4      9      7      0
   test1  test1  test1  test1  test2  test2  test2  test2
0      0      2      4      6      5      3      9      1
1      8      6      5      7      2      7      2      2
2      1      6      6      2      1      7      4      7
3      7      6      8      4      9      1      0      8
4      2      5      8      3      8      7      3      4
5      0      0      0      2      3      4      1      9
6      5      1      1      2      5      5      8      7
7      0      4      6      3      9      1      7      0

希望能帮助到你

关于python - 通过Pandas GroupBy函数(和建议的其他方法)为SVM创建特征(行)向量，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/21168193/