用scikit-learn实现skip gram?

本文介绍了用scikit-learn实现skip gram?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有什么方法可以在scikit-learn库中实现skip-gram?我用n-skip-grams手动生成了一个列表，并将其作为CountVectorizer()方法的词汇传递给了skipgrams.

Is there any way to implement skip-gram in scikit-learn library?I have manually generated a list with n-skip-grams, and pass that to skipgrams as vocabulary for the CountVectorizer() method.

不幸的是，它的预测性能很差:准确性仅为63％.但是，从默认代码中使用ngram_range(min,max)，在CountVectorizer()上我的准确度为77-80％.

Unfortunately, its performance on prediction is very poor: only 63% accuracy.However, I get an accuracy of 77-80% on CountVectorizer() using ngram_range(min,max)from the default code.

在scikit学习中是否有更好的方法来实现跳跃语法?

Is there a better way to implement skip-grams in scikit learn?

这是我的代码部分:

corpus = GetCorpus() # This one get text from file as a list

vocabulary = list(GetVocabulary(corpus,k,n))
# this one returns a k-skip n-gram

vec = CountVectorizer(
          tokenizer=lambda x: x.split(),
          ngram_range=(2,2),
          stop_words=stopWords,
          vocabulary=vocabulary)

推荐答案

要在scikit-learn中使用跳过语法对文本进行矢量化，仅将跳过语法标记作为词汇传递给CountVectorizer将不起作用.您需要修改可使用自定义分析器完成的处理令牌的方式.下面是产生1-skip-2-grams的示例矢量化器，

To vectorize text with skip-grams in scikit-learn simply passing the skip gram tokens as the vocabulary to CountVectorizer will not work. You need to modify the way tokens are processed which can be done with a custom analyzer. Below is an example vectorizer that produces 1-skip-2-grams,

from toolz import itertoolz, compose
from toolz.curried import map as cmap, sliding_window, pluck
from sklearn.feature_extraction.text import CountVectorizer

class SkipGramVectorizer(CountVectorizer):
    def build_analyzer(self):
        preprocess = self.build_preprocessor()
        stop_words = self.get_stop_words()
        tokenize = self.build_tokenizer()
        return lambda doc: self._word_skip_grams(
                compose(tokenize, preprocess, self.decode)(doc),
                stop_words)

    def _word_skip_grams(self, tokens, stop_words=None):
        # handle stop words
        if stop_words is not None:
            tokens = [w for w in tokens if w not in stop_words]

        return compose(cmap(' '.join), pluck([0, 2]), sliding_window(3))(tokens)

例如，在此Wikipedia示例中，

text = ['the rain in Spain falls mainly on the plain']

vect = SkipGramVectorizer()
vect.fit(text)
vect.get_feature_names()

此矢量化器将产生以下令牌，

this vectorizer would yield the following tokens,

['falls on',  'in falls',  'mainly the',  'on plain',
 'rain spain',  'spain mainly',  'the in']

这篇关于用scikit-learn实现skip gram?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！