python - 向CountVectorizer矩阵添加其他功能

我遇到了一个问题，我必须在scikit learn的countvectorizer函数创建的令牌计数列表中添加一个附加功能（平均字长）。假设我有以下代码：

#list of tweets
texts = [(list of tweets)]

#list of average word length of every tweet
average_lengths = word_length(tweets)

#tokenizer
count_vect = CountVectorizer(analyzer = 'word', ngram_range = (1,1))
x_counts = count_vect.fit_transform(texts)

对于每个实例，格式应该是（标记，平均字长）。我最初的想法是简单地使用zip函数连接这两个列表，如下所示：

x = zip(x_counts, average_lengths)

但当我试图符合我的模型时，我会出错：

ValueError: setting an array element with a sequence.

有人知道怎么解决这个问题吗？

最佳答案

您可以在this文章中编写自己的transformer，它为您提供每条tweet的平均字长，并使用FeatureUnion：

vectorizer = FeatureUnion([
        ('cv', CountVectorizer(analyzer = 'word', ngram_range = (1,1))),
        ('av_len', AverageLenVectizer(...))
    ])

关于python - 向CountVectorizer矩阵添加其他功能，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/34397613/

countVectorizer

python - 向CountVectorizer矩阵添加其他功能