问题描述
我具有以下熊猫结构:
col1 col2 col3 text
1 1 0 meaningful text
5 9 7 trees
7 8 2 text
我想使用tfidf矢量化器。但是,这将返回一个解析矩阵,我实际上可以通过 mysparsematrix).toarray()
转换为密集矩阵。但是,如何将带有标签的信息添加到原始df中?因此目标看起来像:
I'd like to vectorise it using a tfidf vectoriser. This, however, returns a parse matrix, which I can actually turn into a dense matrix via mysparsematrix).toarray()
. However, how can I add this info with labels to my original df? So the target would look like:
col1 col2 col3 meaningful text trees
1 1 0 1 1 0
5 9 7 0 0 1
7 8 2 0 1 0
更新:
即使重命名原始列,解决方案也会使连接错误:
删除至少包含一个NaN的列即使我在开始使用它之前仍使用 fillna(0)
,结果只剩下7行。
Solution makes the concatenation wrong even when renaming original columns:Dropping columns with at least one NaN results in only 7 rows left, even though I use fillna(0)
before starting to work with it.
推荐答案
您可以按照以下步骤操作:
You can proceed as follows:
将数据加载到数据框中:
import pandas as pd
df = pd.read_table("/tmp/test.csv", sep="\s+")
print(df)
输出:
col1 col2 col3 text
0 1 1 0 meaningful text
1 5 9 7 trees
2 7 8 2 text
使用以下符号标记文本
列: sklearn.feature_extraction.text.TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df['text'])
转换标记化数据放入数据框:
df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
print(df1)
输出:
meaningful text trees
0 0.795961 0.605349 0.0
1 0.000000 0.000000 1.0
2 0.000000 1.000000 0.0
将标记化数据帧连接到原始数据帧:
res = pd.concat([df, df1], axis=1)
print(res)
输出:
col1 col2 col3 text meaningful text trees
0 1 1 0 meaningful text 0.795961 0.605349 0.0
1 5 9 7 trees 0.000000 0.000000 1.0
2 7 8 2 text 0.000000 1.000000 0.0
如果要删除列 text
,则需要在连接前执行以下操作:
If you want to drop the column text
, you need to do that before the concatenation:
df.drop('text', axis=1, inplace=True)
res = pd.concat([df, df1], axis=1)
print(res)
输出:
col1 col2 col3 meaningful text trees
0 1 1 0 0.795961 0.605349 0.0
1 5 9 7 0.000000 0.000000 1.0
2 7 8 2 0.000000 1.000000 0.0
完整代码:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.read_table("/tmp/test.csv", sep="\s+")
v = TfidfVectorizer()
x = v.fit_transform(df['text'])
df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
df.drop('text', axis=1, inplace=True)
res = pd.concat([df, df1], axis=1)
这篇关于将tfidf附加到pandas数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!