本文介绍了Python Gensim word2vec词汇键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用gensim制作word2vec.我听说词汇语料库应该是unicode,所以我将其转换为unicode.

I want to make word2vec with gensim. I heard that vocabulary corpus should be unicode so I converted it to unicode.

# -*- encoding:utf-8 -*-
# !/usr/bin/env python
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from gensim.models import Word2Vec
import pprint

with open('parsed_data.txt', 'r') as f:
    corpus = map(unicode, f.read().split('\n'))

model = Word2Vec(size=128, window=5, min_count=5, workers=4)
model.build_vocab(corpus,keep_raw_vocab=False)
model.train(corpus)
model.save('w2v')

pprint.pprint(model.most_similar(u'너'))

以上是我的源代码.好像工作正常.但是词汇键存在问题.我想制作使用unicode的韩语word2vec.例如单词사과在英语中表示道歉,并且它的unicode是\xC0AC\xACFC如果我尝试在word2vec中找到사과,则会发生键错误...
代替\xc0ac\xacfc\xacfc分别存储.是什么原因以及如何解决?

Above is my source code. It seems like work well. However there are problem with vocabulary key. I want to make korean word2vec which use unicode. For example word 사과 which means apology in english and it's unicode is \xC0AC\xACFC If I try to find 사과 in word2vec, key error occur...
Instead of \xc0ac\xacfc \xc0ac and \xacfc stores separately.What's the reason and how to solve it?

推荐答案

Word2Vec要求将文本示例分解为单词标记.看来您只是向Word2Vec提供了字符串,所以在对其进行迭代时,只会将单个字符视为单词.

Word2Vec requires text examples that are broken into word-tokens. It appears you are simply providing strings to Word2Vec, so when it iterates over them, it will only be seeing single-characters as words.

韩文是否使用空格来分隔单词?如果是这样,请在将单词列表作为文本示例交给Word2Vec之前,用空格将您的文本分隔开.

Does Korean use spaces to delimit words? If so, break your texts by spaces before handing the list-of-words as a text example to Word2Vec.

否则,在将句子传递给Word2Vec之前,您需要使用一些外部单词标记器(不是gensim的一部分).

If not, you'll need to use some external word-tokenizer (not part of gensim) before passing your sentences to Word2Vec.

这篇关于Python Gensim word2vec词汇键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-24 16:11