pandas - 在scikit-learn中使用Featureunion将两个 Pandas 列合并为tfidf

在将this用作垃圾邮件分类的模型时，我想添加主题的附加功能以及正文。

我在熊猫数据框中拥有所有功能。例如，主题为df ['Subject']，正文为df ['body_text']，垃圾邮件/火腿标签为df ['ham / spam']

我收到以下错误：
TypeError：“ FeatureUnion”对象不可迭代

通过管道函数运行它们时，如何同时使用df ['Subject']和df ['body_text']作为功能？

from sklearn.pipeline import FeatureUnion
features = df[['Subject', 'body_text']].values
combined_2 = FeatureUnion(list(features))

pipeline = Pipeline([
('count_vectorizer',  CountVectorizer(ngram_range=(1, 2))),
('tfidf_transformer',  TfidfTransformer()),
('classifier',  MultinomialNB())])

pipeline.fit(combined_2, df['ham/spam'])

k_fold = KFold(n=len(df), n_folds=6)
scores = []
confusion = numpy.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold:
    train_text = combined_2.iloc[train_indices]
    train_y = df.iloc[test_indices]['ham/spam'].values

    test_text = combined_2.iloc[test_indices]
    test_y = df.iloc[test_indices]['ham/spam'].values

    pipeline.fit(train_text, train_y)
    predictions = pipeline.predict(test_text)
    prediction_prob = pipeline.predict_proba(test_text)

    confusion += confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label='spam')
    scores.append(score)

最佳答案

FeatureUnion并不是要那样使用。相反，它需要两个特征提取器/矢量化器，并将它们应用于输入。它不会按照显示方式在构造函数中获取数据。

CountVectorizer需要一个字符串序列。提供它的最简单方法是将字符串连接在一起。这样会将两列中的两个文本都传递到相同的CountVectorizer。

combined_2 = df['Subject'] + ' '  + df['body_text']

一种替代方法是在每列上分别运行CountVectorizer和可选的TfidfTransformer，然后堆叠结果。

import scipy.sparse as sp

subject_vectorizer = CountVectorizer(...)
subject_vectors = subject_vectorizer.fit_transform(df['Subject'])

body_vectorizer = CountVectorizer(...)
body_vectors = body_vectorizer.fit_transform(df['body_text'])

combined_2 = sp.hstack([subject_vectors, body_vectors], format='csr')

第三种选择是实现自己的转换器，该转换器将提取数据框列。

class DataFrameColumnExtracter(TransformerMixin):

    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X[self.column]

在这种情况下，您可以在两个管道上使用FeatureUnion，每个管道包含您的自定义转换器，然后是CountVectorizer。

subj_pipe = make_pipeline(
       DataFrameColumnExtracter('Subject'),
       CountVectorizer()
)

body_pipe = make_pipeline(
       DataFrameColumnExtracter('body_text'),
       CountVectorizer()
)

feature_union = make_union(subj_pipe, body_pipe)

管道的此功能结合将获取数据帧，并且每个管道将处理其列。它将从给定的两列中生成术语计数矩阵的连接。

 sparse_matrix_of_counts = feature_union.fit_transform(df)

也可以在较大的管道中作为第一步添加此特征联合。

关于pandas - 在scikit-learn中使用Featureunion将两个 Pandas 列合并为tfidf，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/34710281/