我有一个充满.txt文件(文档)的目录。首先,我load文档并去掉一些括号并删除一些引号,因此文档看起来如下所示,例如:

document1:
is a scientific discipline that explores the construction and study of algorithms that can learn from data Such algorithms operate by building a model

document2:
Machine learning can be considered a subfield of computer science and statistics It has strong ties to artificial intelligence and optimization which deliver methods


因此,我从目录中加载文件,如下所示:

preprocessDocuments =[[' '.join(x) for x in sample[:-1]] for sample in load(directory)]


documents = ''.join( i for i in ''.join(str(v) for v
                                              in preprocessDocuments) if i not in "',()")


然后,我试图对document1document2进行矢量化,以创建如下的训练矩阵:

from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(analyzer='word')
X = HashingVectorizer.fit_transform(documents)
X.toarray()


然后是输出:

    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words


给定这个,如何创建矢量表示?我以为我在documents中携带了已加载的文件,但似乎无法容纳这些文件。

最佳答案

documents的内容是什么? It looks like应该是文件名或带有令牌的字符串的列表。另外,您应该使用对象调用fit_transform,而不像静态方法那样。 e。 vectorizer.fit_transform(documents)

例如,这在这里起作用:

from sklearn.feature_extraction.text import HashingVectorizer
documents=['this is a test', 'another test']
vectorizer = HashingVectorizer(analyzer='word')
X = vectorizer.fit_transform(documents)

关于python - 在scikit-learn中适合词汇方面的问题吗?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/27631797/

10-12 16:55