问题描述
我正在python上使用sklearn进行一些聚类.我已经训练了200,000个数据,下面的代码效果很好.
I am using sklearn on Python to do some clustering. I've trained 200,000 data, and code below works well.
corpus = open("token_from_xml.txt")
vectorizer = CountVectorizer(decode_error="replace")
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
km = KMeans(30)
kmresult = km.fit(tfidf).predict(tfidf)
但是,当我有新的测试内容时,我想将其群集到我训练过的现有群集中.因此,我想知道如何保存IDF结果,以便对新的测试内容执行TFIDF并确保新测试内容的结果具有相同的数组长度.
But when I have new testing content, I'd like to cluster it to existed clusters I'd trained. So I'm wondering how to save IDF result, so that I can do TFIDF for the new testing content and make sure the result for new testing content have same array length.
谢谢.
更新
如果其中一个包含经过训练的IDF结果,我可能需要将"transformer"或"tfidf"变量保存到文件(txt或其他文件)中.
I may need to save "transformer" or "tfidf" variable to file(txt or others), if one of them contains the trained IDF result.
更新
例如.我有训练数据:
["a", "b", "c"]
["a", "b", "d"]
然后执行TFIDF,结果将包含4个要素(a,b,c,d)
And do TFIDF, the result will contains 4 features(a,b,c,d)
当我测试时:
["a", "c", "d"]
查看它属于哪个集群(已经由k-means制作). TFIDF将仅给出具有3个特征(a,c,d)的结果,因此k均值的聚类将下降. (如果我测试["a", "b", "e"]
,可能还有其他问题.)
to see which cluster(already made by k-means) it belongs to. TFIDF will only give the result with 3 features(a,c,d), so the clustering in k-means will fall. (If I test ["a", "b", "e"]
, there may have other problems.)
那么如何存储用于测试数据的功能列表(甚至将其存储在文件中)?
So how to store the features list for testing data (even more, store it in file)?
更新
已解决,请参见下面的答案.
Solved, see answers below.
推荐答案
我通过保存vectorizer.vocabulary_
成功保存了功能列表,并由CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)
I successfully saved the feature list by saving vectorizer.vocabulary_
, and reuse by CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)
以下代码:
corpus = np.array(["aaa bbb ccc", "aaa bbb ddd"])
vectorizer = CountVectorizer(decode_error="replace")
vec_train = vectorizer.fit_transform(corpus)
#Save vectorizer.vocabulary_
pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))
#Load it later
transformer = TfidfTransformer()
loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))
tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))
那行得通. tfidf
将具有与训练数据相同的特征长度.
That works. tfidf
will have same feature length as trained data.
这篇关于保留TFIDF结果以使用Scikit for Python预测新内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!