问题描述
我在 Python 上使用 sklearn 来做一些聚类.我已经训练了 200,000 个数据,下面的代码运行良好.
I am using sklearn on Python to do some clustering. I've trained 200,000 data, and code below works well.
corpus = open("token_from_xml.txt")
vectorizer = CountVectorizer(decode_error="replace")
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
km = KMeans(30)
kmresult = km.fit(tfidf).predict(tfidf)
但是当我有新的测试内容时,我想将它聚集到我训练过的现有集群中.所以我想知道如何保存 IDF 结果,以便我可以对新的测试内容进行 TFIDF,并确保新测试内容的结果具有相同的数组长度.
But when I have new testing content, I'd like to cluster it to existed clusters I'd trained. So I'm wondering how to save IDF result, so that I can do TFIDF for the new testing content and make sure the result for new testing content have same array length.
提前致谢.
更新
如果其中一个包含经过训练的 IDF 结果,我可能需要将transformer"或tfidf"变量保存到文件(txt 或其他)中.
I may need to save "transformer" or "tfidf" variable to file(txt or others), if one of them contains the trained IDF result.
更新
例如.我有训练数据:
["a", "b", "c"]
["a", "b", "d"]
做TFIDF,结果将包含4个特征(a,b,c,d)
And do TFIDF, the result will contains 4 features(a,b,c,d)
当我测试时:
["a", "c", "d"]
查看它属于哪个集群(已经由 k-means 创建).TFIDF 只会给出具有 3 个特征 (a,c,d) 的结果,因此 k-means 中的聚类会下降.(如果我测试["a", "b", "e"]
,可能还有其他问题.)
to see which cluster(already made by k-means) it belongs to. TFIDF will only give the result with 3 features(a,c,d), so the clustering in k-means will fall. (If I test ["a", "b", "e"]
, there may have other problems.)
那么如何存储测试数据的特征列表(更甚者,存储在文件中)?
So how to store the features list for testing data (even more, store it in file)?
更新
已解决,请参阅下面的答案.
Solved, see answers below.
推荐答案
我通过保存vectorizer.vocabulary_
成功保存了特征列表,并通过CountVectorizer(decode_error="replace",词汇=vectorizer.vocabulary_)
I successfully saved the feature list by saving vectorizer.vocabulary_
, and reuse by CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)
以下代码:
corpus = np.array(["aaa bbb ccc", "aaa bbb ddd"])
vectorizer = CountVectorizer(decode_error="replace")
vec_train = vectorizer.fit_transform(corpus)
#Save vectorizer.vocabulary_
pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))
#Load it later
transformer = TfidfTransformer()
loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))
tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))
那行得通.tfidf
将具有与训练数据相同的特征长度.
That works. tfidf
will have same feature length as trained data.
这篇关于使用 Scikit for Python 保留 TFIDF 结果以预测新内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!