在scikit-learn CountVectorizer中使用nltk正则表达式示例

本文介绍了在scikit-learn CountVectorizer中使用nltk正则表达式示例的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图使用scikit-learn的CountVectorizer中的nltk书中的示例作为regex模式的正则表达式模式.我看到带有简单正则表达式的示例，但没有类似这样的示例:

I was trying to use an example from the nltk book for a regex pattern inside the CountVectorizer from scikit-learn. I see examples with simple regex but not with something like this:

pattern = r''' (?x)         # set flag to allow verbose regexps
    ([A-Z]\.)+          # abbreviations (e.g. U.S.A.)
    | \w+(-\w+)*        # words with optional internal hyphens
    | \$?\d+(\.\d+)?%?  # currency & percentages
    | \.\.\.            # ellipses '''

text = 'I love N.Y.C. 100% even with all of its traffic-ridden streets...'
vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
analyze = vectorizer.build_analyzer()
analyze(text)

这将产生:

[(u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'-ridden', u''),
 (u'', u'', u''),
 (u'', u'', u'')]

使用nltk，我得到了完全不同的东西:

With nltk, I get something entirely different:

nltk.regexp_tokenize(text,pattern)

['I'， '爱'， 'N.Y.C.'， '100'， '甚至'， '和'， '全部'， '的'， '它的'，交通繁忙"，街道"， '...']

['I', 'love', 'N.Y.C.', '100', 'even', 'with', 'all', 'of', 'its', 'traffic-ridden', 'streets', '...']

有没有办法让skl CountVectorizer输出相同的东西?我希望使用在同一函数调用中合并的其他一些方便功能.

Is there a way to get the skl CountVectorizer to output the same thing? I was hoping to use some of the other handy features that are incorporated in the same function call.

推荐答案

TL; DR

from functools import partial
CountVectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))

是使用NLTK标记程序的矢量化程序.

is a vectorizer that uses the NLTK tokenizer.

现在要解决实际的问题:显然nltk.regexp_tokenize在其模式上做了一些非常特别的事情，而scikit-learn只是在您提供的模式上做了一个re.findall，而findall不喜欢这种模式:

Now for the actual problem: apparently nltk.regexp_tokenize does something quite special with its pattern, whereas scikit-learn simply does an re.findall with the pattern you give it, and findall doesn't like this pattern:

In [33]: re.findall(pattern, text)
Out[33]:
[('', '', ''),
 ('', '', ''),
 ('C.', '', ''),
 ('', '', ''),
 ('', '', ''),
 ('', '', ''),
 ('', '', ''),
 ('', '', ''),
 ('', '', ''),
 ('', '-ridden', ''),
 ('', '', ''),
 ('', '', '')]

您要么必须重写此模式以使其以scikit-learn样式工作，要么将NLTK标记器插入scikit-learn:

You'll either have to rewrite this pattern to make it work in scikit-learn style, or plug the NLTK tokenizer into scikit-learn:

In [41]: from functools import partial

In [42]: v = CountVectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))

In [43]: v.build_analyzer()(text)
Out[43]:
['I',
 'love',
 'N.Y.C.',
 '100',
 'even',
 'with',
 'all',
 'of',
 'its',
 'traffic-ridden',
 'streets',
 '...']

这篇关于在scikit-learn CountVectorizer中使用nltk正则表达式示例的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！