我有一个包含144个反馈的训练数据集,分别有72个正反馈和72个负反馈。有两个目标标签分别是正数和负数。考虑以下代码段:
import pandas as pd
feedback_data = pd.read_csv('output.csv')
print(feedback_data)
data target
0 facilitates good student teacher communication. positive
1 lectures are very lengthy. negative
2 the teacher is very good at interaction. positive
3 good at clearing the concepts. positive
4 good at clearing the concepts. positive
5 good at teaching. positive
6 does not shows test copies. negative
7 good subjective knowledge. positive
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(feedback_data)
X = cv.transform(feedback_data)
X_test = cv.transform(feedback_data_test)
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
target = [1 if i<72 else 0 for i in range(144)]
# the below line gives error
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)
我不明白问题是什么。请帮忙。
最佳答案
您没有正确使用计数向量化器。这是您现在拥有的:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(df)
X = cv.transform(df)
X
<2x2 sparse matrix of type '<class 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Row format>
因此,您会发现自己没有达到想要的目标。您没有正确变换每一行。您甚至没有正确地训练计数矢量化器,因为您使用了整个DataFrame而不只是注释的语料库。
要解决此问题,我们需要确保计数工作良好:
如果您这样做(使用正确的语料库):
cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = cv.transform(df)
X
<2x23 sparse matrix of type '<class 'numpy.int64'>'
with 0 stored elements in Compressed Sparse Row format>
您会看到我们正在接近我们想要的。我们只需要对它进行正确的转换(转换每一行):
cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = df['data'].apply(lambda x: cv.transform([x])).values
X
array([<1x23 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>,
...
<1x23 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>], dtype=object)
我们有一个更合适的X!现在我们只需要检查是否可以拆分:
target = [1 if i<72 else 0 for i in range(8)] # The dataset is here of size 8
# the below line gives error
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)
而且有效!
您需要确定您了解CountVectorizer如何正确使用它
关于machine-learning - 找到样本数量不一致的输入变量:[2,144],我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/54863302/