我正在开发一个文本主题分类器,它可以标记句子或小问题。
到目前为止,它可以标记大约30个已知主题。
效果很好,但是开始使类似的问题相互混淆。
例如,以下3个标签:
1)标签-backup_proxy_intranet:
如何为Intranet应用设置备份代理?
...和140个类似的问题,其中包含“ Intranet应用程序的备份代理” ...
2)标签-Smartphone_intranet:
如何在智能手机中使用Intranet应用程序?和
...以及140个类似的问题,其中包含“我的智能手机中的Intranet应用程序” ...
3)标签-ticket_intranet:如何将票单与Intranet应用关联?
...以及140个类似的问题,其中包含“使用Intranet应用程序的机票定单” ...
训练后,这3个总是返回标签backup_proxy_intranet。
我该怎么做才能将它们分开?
series = series.dropna()
series = shuffle(series)
X_stemmed = []
for x_t in series['phrase']:
stemmed_text = [stemmer.stem(i) for i in word_tokenize(x_t)]
X_stemmed.append(' '.join(stemmed_text))
x_normalized = []
for x_t in X_stemmed:
temp_corpus=x_t.split(' ')
corpus=[token for token in temp_corpus if token not in stops]
x_normalized.append(' '.join(corpus))
X_train,X_test,y_train,y_test = train_test_split(x_normalized,series['target'],random_state=0,test_size=0.20)
vect = CountVectorizer(ngram_range=(1,3)).fit(X_train)
X_train_vectorized = vect.transform(X_train)
sampler = SMOTE()
model = make_pipeline(sampler, LogisticRegression())
print()
print("-->Model: ")
print(model)
print()
print("-->Training... ")
model.fit(X_train_vectorized,y_train)
filename = '/var/www/html/python/intraope_bot/lib/textTopicClassifier.model'
pickle.dump(model,open(filename, 'wb'))
filename2 = '/var/www/html/python/intraope_bot/lib/textTopicClassifier.vector'
pickle.dump(vect,open(filename2, 'wb'))
最好的祝福!
最佳答案
我认为您可能想使用sklearn的TfidfVectorizer:它应该可以帮助您提高得分!
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
... "Label - backup_proxy_intranet: How to set up a backup proxy for intranet app? ... and 140 similar questions containing 'backup proxy for intranet app'"
... Label - smartphone_intranet: How to use intranet app in my smartphone? and ... and 140 similar questions containing 'intranet app in my smartphone'...
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
关于machine-learning - 如何改善我的文本主题分类器?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/54660376/