问题描述
这是我在Stack Overflow上遇到的第一个问题.
This is my first question at Stack Overflow.
我有一个像这样的Pandas DataFrame.
I have a DataFrame of Pandas like this.
a b c d
one 0 1 2 3
two 4 5 6 7
three 8 9 0 1
four 2 1 1 5
five 1 1 8 9
我想提取列名称和数据对,数据对为1,每个索引在数组处是独立的.
I want to extract the pairs of column name and data whose data is 1 and each index is separate at array.
[ [(b,1.0)], [(d,1.0)], [(b,1.0),(c,1.0)], [(a,1.0),(b,1.0)] ]
我想使用需要语料库作为这种形式的python库的gensim.
I want to use gensim of python library which requires corpus as this form.
是否有任何聪明的方法可以执行此操作或从熊猫数据中应用gensim?
Is there any smart way to do this or to apply gensim from pandas data?
推荐答案
许多gensim函数都接受numpy数组,因此可能有更好的方法...
Many gensim functions accept numpy arrays, so there may be a better way...
In [11]: is_one = np.where(df == 1)
In [12]: is_one
Out[12]: (array([0, 2, 3, 3, 4, 4]), array([1, 3, 1, 2, 0, 1]))
In [13]: df.index[is_one[0]], df.columns[is_one[1]]
Out[13]:
(Index([u'one', u'three', u'four', u'four', u'five', u'five'], dtype='object'),
Index([u'b', u'd', u'b', u'c', u'a', u'b'], dtype='object'))
要对每一行进行分组,可以使用以下行列:
To groupby each row, you could use iterrows:
from itertools import repeat
In [21]: [list(zip(df.columns[np.where(row == 1)], repeat(1.0)))
for label, row in df.iterrows()
if 1 in row.values] # if you don't want empty [] for rows without 1
Out[21]:
[[('b', 1.0)],
[('d', 1.0)],
[('b', 1.0), ('c', 1.0)],
[('a', 1.0), ('b', 1.0)]]
在python 2中,由于zip返回列表,因此不需要list
.
这篇关于从Pandas DataFrame中提取数组(列名,数据)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!