未完 待译

word2vec 模块——使用word2vec进行深度学习

使用hierarchical softmax或者negative sampling 进行深度学习生成词向量,通过word2vec的 skip-gram和CBOW 模型

注意:在gensim中除了word2vec还有几种方法可以获得词向量。可以参考FashText和封装的VarEmbed和wordRank

该算法源于它的C版本: https://code.google.com/p/word2vec/
并增加了一些扩展功能

关于gensim word2vec的学习指南,可以参考GoogleNews的webapp,地址如下 http://radimrehurek.com/2014/02/word2vec-tutorial/

安装gensim之前要确保已经安装了C编译器,不然就没办法编译word2vec。该版本比起NumPy实现的简单版本速度提升了70%。

初始化示例:

>>> model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

把模型存储到磁盘上
Persist a model to disk with:

>>> model.save(fname)
>>> model = Word2Vec.load(fname)  # you can continue training with the loaded model!可以在加载模型的基础上继续训练

词向量存储在关键词-向量的实例model.wv中。如下示例了在关键词-向量集中的查找操作
The word vectors are stored in a KeyedVectors instance in model.wv. This separates the read-only word vector lookup operations in KeyedVectors from the training code in Word2Vec:

>>> model.wv['computer']  # numpy vector of a word 处理一个词的向量
array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

使用C版本的word2vec结果也可以用于词向量的实例化
The word vectors can also be instantiated from an existing file on disk in the word2vec C format as a KeyedVectors instance.
注意:C版本的模型不用再用于重新训练,因为隐藏权重,词频和binary tree都丢失了
NOTE: It is impossible to continue training the vectors loaded from the C format because hidden weights, vocabulary frequency and the binary tree is missing:

>>> from gensim.models import KeyedVectors
>>> word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)  # C text format
>>> word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True)  # C binary format

利用这个模型,你可以做很多NLP的任务,其中已经构建了一些:
You can perform various NLP word tasks with the model. Some of them are already built-in:

>>> model.wv.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]

>>> model.wv.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
[('queen', 0.71382287), ...]


>>> model.wv.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'

>>> model.wv.similarity('woman', 'man')
0.73723527

基于该模型的文本的概率
Probability of a text under the model:

>>> model.score(["The fox jumped over a lazy dog".split()])
0.2158356

与人类对词汇相似性的看法相关联
Correlation with human opinion on word similarity:

>>> model.wv.evaluate_word_pairs(os.path.join(module_path, 'test_data','wordsim353.tsv'))
0.51, 0.62, 0.13

类比:
And on analogies:

>>> model.wv.accuracy(os.path.join(module_path, 'test_data', 'questions-words.txt'))

and so on.
如果你已经训练好了一个模型,不再对它进行更新,只用于查询,那就把它转化为一个关键词-向量模型实例
If you’re finished training a model (i.e. no more updates, only querying), then switch to the gensim.models.KeyedVectors instance in wv

>>> word_vectors = model.wv
>>> del model

to trim unneeded model memory = use much less RAM.

Note that there is a gensim.models.phrases module which lets you automatically detect phrases longer than one word. Using phrases, you can learn a word2vec model where “words” are actually multiword expressions, such as <cite>new_york_times</cite> or <cite>financial_crisis</cite>:
三个引用:

接下来是代码介绍部分:

class gensim.models.word2vec.BrownCorpus(dirname)
Bases: object

Iterate over sentences from the Brown corpus (part of NLTK data).
class gensim.models.word2vec.LineSentence(source, max_sentence_length=10000, limit=None)
Bases: object
预处理类,限制句子最大长度,文档最大行数
Simple format: one sentence = one line; words already preprocessed and separated by whitespace.
一个句子即一行,单词需要预先使用空格分隔
source can be either a string or a file object. Clip the file to the first limit lines (or not clipped if limit is None, the default).

Example:

sentences = LineSentence('myfile.txt')
Or for compressed files:

sentences = LineSentence('compressed_text.txt.bz2')
sentences = LineSentence('compressed_text.txt.gz')
class gensim.models.word2vec.PathLineSentences(source, max_sentence_length=10000, limit=None)
Bases: object
作用同上一个类,对一个目录下的所有文件生效,对子目录无效
Works like word2vec.LineSentence, but will process all files in a directory in alphabetical order by filename.
该路径下的文件 只有后缀为bz2,gz和text的文件可以被读取,其他的文件都会被认为是text文件
The directory can only contain files that can be read by LineSentence: .bz2, .gz, and text files. Any file not ending with .bz2 or .gz is assumed to be a text file. Does not work with subdirectories.
一个句子即一行,单词需要预先使用空格分隔
The format of files (either text, or compressed text files) in the path is one sentence = one line, with words already preprocessed and separated by whitespace.
源处填写的必须是一个目录,务必保证该目录下的文件都能被该类读取。如果设置了读取限制,那么只读取限定的行数。
source should be a path to a directory (as a string) where all files can be opened by the LineSentence class. Each file will be read up to limit lines (or not clipped if limit is None, the default).
用例:
Example:

sentences = PathLineSentences(os.getcwd() + '\corpus\')
目录下的文件应该是如此种种。
The files in the directory should be either text files, .bz2 files, or .gz files.

这是一个测试用语料库获取类

class gensim.models.word2vec.Text8Corpus(fname, max_sentence_length=10000)
Bases: object
从一个叫‘text8’的语料库中获取数据,该语料来源于以下网址,参数max_sentence_length限定了获取的语料长度
Iterate over sentences from the “text8” corpus, unzipped from http://mattmahoney.net/dc/text8.zip .
class gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False, callbacks=())
Bases: gensim.models.base_any2vec.BaseWordEmbeddingsModel
该类为 训练,使用和评估 神经网络 具体描述参见网址
Class for training, using and evaluating neural networks described in https://code.google.com/p/word2vec/
如果你已经完成训练(不再更新,只是查询),那就把模型转换为KeyedVectors类实例
If you’re finished training a model (=no more updates, only querying) then switch to the gensim.models.KeyedVectors instance in wv
模型可以使用save()和load()方法来存储/加载,word2vec的原始版本格式也可以兼容,使用 wv.save_word2vec_format() 和Word2VecKeyedVectors.load_word2vec_format()
The model can be stored/loaded via its save() and load() methods, or stored/loaded in a format compatible with the original word2vec implementation via wv.save_word2vec_format() and Word2VecKeyedVectors.load_word2vec_format().
使用迭代器输入方式初始化模型。用于训练的每个句子就是一个词list
Initialize the model from an iterable of sentences. Each sentence is a list of words (unicode strings) that will be used for training.
参数:
Parameters:
参数sentences(可以是迭代器)该参数可以只是一系列的词list,但是对于非常大的语料库,要从磁盘或网络上获取的,还是考虑迭代器流式处理,比如BrownCorpus,Text8Corpus 或者LineSentence。如果你想用别的方法来初始化,那你可以不提供该参数,就会提供一个未初始化的model。
sentences (iterable of iterables) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples. If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way.
sg定义训练算法的参数,如果填1,那么使用skip-gram算法,其他情况就使用CBOW算法
sg (int {1, 0}) – Defines the training algorithm. If 1, skip-gram is employed; otherwise, CBOW is used.
size特征向量的维数
size (int) – Dimensionality of the feature vectors.
窗口长度  一个句子中当前词和预测词的最大距离
window (int) – The maximum distance between the current and predicted word within a sentence.
alpha 初始化学习率
alpha (float) – The initial learning rate.
训练过程中 学习率线性下降的最小值
min_alpha (float) – Learning rate will linearly drop to min_alpha as training progresses.
随机数生成器的种子。单词的初始化向量使用哈希的词加种子字符串的方式。不过要注意,想要一个完全可重现的运行状态,那只能使用一个线程来运行(workers=1),这样才能消除操作系统线程调度产生的排序抖动。(在python3中,编译器的可重现性也需要使用PYTHONHASHSEED 环境变量来控制哈希随机)
seed (int) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization).
min_count 忽略计数频率比这个值低的单词
min_count (int) – Ignores all words with total frequency lower than this.
max_vocab_size 最大词汇数,用于构建词表时控制内存。如果词汇数超过了这个值,就把低频的去掉。差不多每一百万个词需要1GB内存。设置为None表示无限制
max_vocab_size (int) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.
sample 该值用于调整高频词汇随机下采样,经验区间为 (0, 1e-5)
sample (float) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).
线程数,使用多线程训练模型(使用多核机器会加速训练)。
workers (int) – Use these many worker threads to train the model (=faster training with multicore machines).
设置为1会使用 hierarchical softmax ,如果设为0且参数negative不为0,则使用负采样方法
hs (int {1,0}) – If 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used.
如果取值大于0,则使用负采样,该值指明需要生成多少噪声词(通常是5-20),设为0则不使用负采样
negative (int) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
如果设为0,使用上下文词向量的总和,设为1,则使用平均值,且仅在使用cbow时应用
cbow_mean (int {1,0}) – If 0, use the sum of the context word vectors. If 1, use the mean, only applies when cbow is used.
哈希方程用于随机初始化权重,可以提升训练的可重现性
hashfxn (function) – Hash function to use to randomly initialize weights, for increased training reproducibility.
训练迭代次数
iter (int) – Number of iterations (epochs) over the corpus.
词表修剪规则,用于表明某些词汇是否要保留在词表中,或者要修剪掉,抑或默认处理(如词频小于设定值就删掉)。可以设为None(会使用最小词频,参考keep_vocab_item()方法),或者调用一些参数并返回gensim.utils.RULE_DISCARD(删除), gensim.utils.RULE_KEEP (保留)or gensim.utils.RULE_DEFAULT(默认). 注意:如果给出规则,只会用于build_vocab()修剪词表,而不会存储在model中。
trim_rule (function) – Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.
设为1 在设置词频前按词频倒排
sorted_vocab (int {1,0}) – If 1, sort the vocabulary by descending frequency before assigning word indexes.
批量词汇最大值,分发给线程的一批目标数目(词数)。如果个别文本超过1w词,则会发送更大的一批,但是标准cython会消减到该数目。
batch_words (int) – Target size (in words) for batches of examples passed to worker threads (and thus cython routines).(Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)
计算损失。如果设为true,计算并存储损失值,并可在模型中检索
compute_loss (bool) – If True, computes and stores loss value which can be retrieved using model.get_latest_training_loss().
回调。训练期间一系列指定参数状态,并需要被执行的回调
callbacks – List of callbacks that need to be executed/run at specific stages during training.
例子
Examples
初始化并训练一个word2vec模型
Initialize and train a Word2Vec model

>>> from gensim.models import Word2Vec,
>>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
>>>
>>> model = Word2Vec(sentences, min_count=1)
#获取训练好的词向量
>>> say_vector = model['say']  # get vector for word

accuracy(**kwargs)
build_vocab(sentences, update=False, progress_per=10000, keep_raw_vocab=False, trim_rule=None, **kwargs)
从一系列的句子中构建词汇表(可以是一次性生成数据流)。每个句子都可以是可迭代的(当然也可以是简单的strings 使用unicode编码)
Build vocabulary from a sequence of sentences (can be a once-only generator stream). Each sentence is a iterable of iterables (can simply be a list of unicode strings too).
参数:
Parameters:
该参数可以参考上一个方法
sentences (iterable of iterables) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples.
是否更新——设置为true,在预料中的新单词将会添加进模型词表
update (bool) – If true, the new words in sentences will be added to model’s vocab.
更新批次——标明在展示/更新进度前要处理多少单词
progress_per (int) – Indicates how many words to process before showing/updating the progress.
从标注词频的字典中构建词表。该字典需要包含(单词,单词次数)。单词必须是unicode编码的strings
build_vocab_from_freq(word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False)
Build vocabulary from a dictionary of word frequencies. Build model vocabulary from a passed dictionary that contains (word,word count). Words must be of type unicode strings.
参数:
Parameters:
词-频字典 记录词,词频的字典
word_freq (dict) – Word,Word_Count dictionary.
如果设为false,在计算完之后会删除原始词表,腾出内存
keep_raw_vocab (bool) – If not true, delete the raw vocabulary after the scaling is done and free up RAM.
语料数目,即使没有提供任何语料,该参数也可以给出一个明确的数字,没影响
corpus_count (int) – Even if no corpus is provided, this argument can set corpus_count explicitly.
过滤规则 参见上个方法
trim_rule (function) – Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.
是否更新——如果设为true,在word_freq字典中提供的新词会加入模型词表
update (bool) – If true, the new provided words in word_freq dict will be added to model’s vocab.
例子
Examples

>>> from gensim.models import Word2Vec
>>>
>>> model= Word2Vec()
>>> model.build_vocab_from_freq({"Word1": 15, "Word2": 20})

clear_sims()
Removes all L2-normalized vectors for words from the model. You will have to recompute them using init_sims method.

cum_table
delete_temporary_training_data(replace_word_vectors_with_normalized=False)
Discard parameters that are used in training and score. Use if you’re sure you’re done training a model. If replace_word_vectors_with_normalized is set, forget the original vectors and only keep the normalized ones = saves lots of memory!

doesnt_match(**kwargs)
Deprecated. Use self.wv.doesnt_match() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.doesnt_match

estimate_memory(vocab_size=None, report=None)
Estimate required memory for a model using current settings and provided vocabulary size.

evaluate_word_pairs(**kwargs)
Deprecated. Use self.wv.evaluate_word_pairs() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.evaluate_word_pairs

get_latest_training_loss()
hashfxn
init_sims(replace=False)
init_sims() resides in KeyedVectors because it deals with syn0/vectors mainly, but because syn1 is not an attribute of KeyedVectors, it has to be deleted in this class, and the normalizing of syn0/vectors happens inside of KeyedVectors

intersect_word2vec_format(fname, lockf=0.0, binary=False, encoding='utf8', unicode_errors='strict')
Merge the input-hidden weight matrix from the original C word2vec-tool format given, where it intersects with the current vocabulary. (No words are added to the existing vocabulary, but intersecting words adopt the file’s weights, and non-intersecting words are left alone.)

Parameters:
fname (str) – The file path used to save the vectors in
binary (bool) – If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
lockf (float) – Lock-factor value to be set for any imported word-vectors; the default value of 0.0 prevents further updating of the vector during subsequent training. Use 1.0 to allow further training updates of merged vectors.
iter
layer1_size
classmethod load(*args, **kwargs)
Loads a previously saved Word2Vec model. Also see save().

Parameters: fname (str) – Path to the saved file.
Returns:    Returns the loaded model as an instance of :class: ~gensim.models.word2vec.Word2Vec.
Return type:    obj: ~gensim.models.word2vec.Word2Vec
classmethod load_word2vec_format(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<type 'numpy.float32'>)
Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead.

static log_accuracy()
min_count
most_similar(**kwargs)
Deprecated. Use self.wv.most_similar() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar

most_similar_cosmul(**kwargs)
Deprecated. Use self.wv.most_similar_cosmul() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar_cosmul

n_similarity(**kwargs)
Deprecated. Use self.wv.n_similarity() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.n_similarity

predict_output_word(context_words_list, topn=10)
Report the probability distribution of the center word given the context words as input to the trained model.

Parameters:
context_words_list – List of context words
topn (int) – Return topn words and their probabilities
Returns:
topn length list of tuples of (word, probability)

Return type:
obj: list of :obj: tuple

reset_from(other_model)
Borrow shareable pre-built structures (like vocab) from the other_model. Useful if testing multiple models in parallel on the same corpus.

sample
save(*args, **kwargs)
Save the model. This saved model can be loaded again using load(), which supports online training and getting vectors for vocabulary words.

Parameters: fname (str) – Path to the file.
save_word2vec_format(fname, fvocab=None, binary=False)
Deprecated. Use model.wv.save_word2vec_format instead.

score(sentences, total_sentences=1000000, chunksize=100, queue_factor=2, report_delay=1)
Score the log probability for a sequence of sentences (can be a once-only generator stream). Each sentence must be a list of unicode strings. This does not change the fitted model in any way (see Word2Vec.train() for that).

We have currently only implemented score for the hierarchical softmax scheme, so you need to have run word2vec with hs=1 and negative=0 for this to work.

Note that you should specify total_sentences; we’ll run into problems if you ask to score more than this number of sentences but it is inefficient to set the value too high.

See the article by [4] and the gensim demo at [5] for examples of how to use such scores in document classification.

[4] Taddy, Matt. Document Classification by Inversion of Distributed Language Representations, in Proceedings of the 2015 Conference of the Association of Computational Linguistics.
[5] https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb
Parameters:
sentences (iterable of iterables) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples.
total_sentences (int) – Count of sentences.
chunksize (int) – Chunksize of jobs
queue_factor (int) – Multiplier for size of queue (number of workers * queue_factor).
report_delay (float) – Seconds to wait before reporting progress.
similar_by_vector(**kwargs)
Deprecated. Use self.wv.similar_by_vector() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similar_by_vector

similar_by_word(**kwargs)
Deprecated. Use self.wv.similar_by_word() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similar_by_word

similarity(**kwargs)
Deprecated. Use self.wv.similarity() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity

syn0_lockf
syn1
syn1neg
train(sentences, total_examples=None, total_words=None, epochs=None, start_alpha=None, end_alpha=None, word_count=0, queue_factor=2, report_delay=1.0, compute_loss=False, callbacks=())
Update the model’s neural weights from a sequence of sentences (can be a once-only generator stream). For Word2Vec, each sentence must be a list of unicode strings. (Subclasses may accept other examples.)

To support linear learning-rate decay from (initial) alpha to min_alpha, and accurate progress-percentage logging, either total_examples (count of sentences) or total_words (count of raw words in sentences) MUST be provided (if the corpus is the same as was provided to build_vocab(), the count of examples in that corpus will be available in the model’s corpus_count property).
为了避免常见错误,该模型会做多组训练来自行规避,必须提供一个明确的时代参数。在通常情况下,train()方法只调用一次,该模型的缓存迭代值应该作为时代值。
To avoid common mistakes around the model’s ability to do multiple training passes itself, an explicit epochs argument MUST be provided. In the common and recommended case, where train() is only called once, the model’s cached iter value should be supplied as epochs value.

Parameters:
sentences (iterable of iterables) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples.
total_examples (int) – Count of sentences.
total_words (int) – Count of raw words in sentences.
epochs (int) – Number of iterations (epochs) over the corpus.
start_alpha (float) – Initial learning rate.
end_alpha (float) – Final learning rate. Drops linearly from start_alpha.
word_count (int) – Count of words already trained. Set this to 0 for the usual case of training on all words in sentences.
queue_factor (int) – Multiplier for size of queue (number of workers * queue_factor).
report_delay (float) – Seconds to wait before reporting progress.
compute_loss (bool) – If True, computes and stores loss value which can be retrieved using model.get_latest_training_loss().
callbacks – List of callbacks that need to be executed/run at specific stages during training.
Examples

>>> from gensim.models import Word2Vec
>>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
>>>
>>> model = Word2Vec(min_count=1)
>>> model.build_vocab(sentences)
>>> model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)
wmdistance(**kwargs)
Deprecated. Use self.wv.wmdistance() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.wmdistance
02-18 05:11