问题描述
我有这样的语料库:
X_train = [ ['this is an dummy example']
['in reality this line is very long']
...
['here is a last text in the training set']
]
和一些标签:
y_train = [1, 5, ... , 3]
我想按以下方式使用Pipeline和GridSearch:
I would like to use Pipeline and GridSearch as follows:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('reg', SGDRegressor())
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
'tfidf__use_idf': (True, False),
'reg__alpha': (0.00001, 0.000001),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)
grid_search.fit(X_train, y_train)
运行此命令时,出现错误消息AttributeError: lower not found
.
When I run this, I get an error saying AttributeError: lower not found
.
我搜索并发现了关于此错误的问题此处,这使我相信我的文本没有被标记化(这听起来好像很麻烦,因为我使用的是列表列表作为输入数据,其中每个列表都包含一个不间断的字符串),这是一个问题.
I searched and found a question about this error here, which lead me to believe that there was a problem with my text not being tokenized (which sounded like it hit the nail on the head, since I was using a list of list as input data, where each list contained one single unbroken string).
我制作了一个快速且肮脏的令牌生成器来测试这一理论:
I cooked up a quick and dirty tokenizer to test this theory:
def my_tokenizer(X):
newlist = []
for alist in X:
newlist.append(alist[0].split(' '))
return newlist
它应该执行预期的操作,但是当我在CountVectorizer
的参数中使用它时:
which does what it is supposed to, but when I use it in the arguments to the CountVectorizer
:
pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=my_tokenizer)),
...我仍然遇到相同的错误,好像什么也没发生.
...I still get the same error as if nothing happened.
我确实注意到我可以通过注释管道中的CountVectorizer
来避免该错误.这很奇怪...我认为您必须先使用数据结构进行转换才能使用TfidfTransformer()
...在这种情况下,计数矩阵是不可用的.
I did notice that I can circumvent the error by commenting out the CountVectorizer
in my Pipeline. Which is strange...I didn't think you could use the TfidfTransformer()
without first having a data structure to transform...in this case the matrix of counts.
为什么我不断收到此错误?实际上,很高兴知道此错误的含义! (是否调用了lower
将文本转换为小写字母?从读取堆栈跟踪信息中我看不出来).我是在滥用管道...还是仅使用CountVectorizer
的参数确实是问题所在?
Why do I keep getting this error? Actually, it would be nice to know what this error means! (Was lower
called to convert the text to lowercase or something? I can't tell from reading the stack trace). Am I misusing the Pipeline...or is the problem really an issue with the arguments to the CountVectorizer
alone?
任何建议将不胜感激.
推荐答案
这是因为您的数据集格式错误,因此您应该传递"一个可迭代的结果,可将str,unicode或文件对象生成到CountVectorizer的fit函数中(或管道,无所谓).不可与其他带有文本的可迭代对象一起迭代(如您的代码中一样).如果您的列表是可迭代的,则应传递成员为字符串的平面列表(而不是其他列表).
It's because your dataset is in wrong format, you should pass "An iterable which yields either str, unicode or file objects" into CountVectorizer's fit function (Or into pipeline, doesn't matter). Not iterable over other iterables with texts (as in your code). In your case List is iterable, and you should pass flat list whose members are strings (not another lists).
即您的数据集应如下所示:
i.e. your dataset should look like:
X_train = ['this is an dummy example',
'in reality this line is very long',
...
'here is a last text in the training set'
]
看这个示例,它非常有用:用于文本特征提取和评估的示例管道
Look at this example, very useful: Sample pipeline for text feature extraction and evaluation
这篇关于AttributeError:找不到更低的值;在scikit-learn中将管道与CountVectorizer一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!