Python Gensim word2vec词汇键

本文介绍了Python Gensim word2vec词汇键的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想用gensim制作word2vec.我听说词汇语料库应该是unicode，所以我将其转换为unicode.

I want to make word2vec with gensim. I heard that vocabulary corpus should be unicode so I converted it to unicode.

# -*- encoding:utf-8 -*-
# !/usr/bin/env python
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from gensim.models import Word2Vec
import pprint

with open('parsed_data.txt', 'r') as f:
    corpus = map(unicode, f.read().split('\n'))

model = Word2Vec(size=128, window=5, min_count=5, workers=4)
model.build_vocab(corpus,keep_raw_vocab=False)
model.train(corpus)
model.save('w2v')

pprint.pprint(model.most_similar(u'너'))

以上是我的源代码.好像工作正常.但是词汇键存在问题.我想制作使用unicode的韩语word2vec.例如单词사과在英语中表示道歉，并且它的unicode是\xC0AC\xACFC如果我尝试在word2vec中找到사과，则会发生键错误...
代替\xc0ac\xacfc和\xacfc分别存储.是什么原因以及如何解决?

Above is my source code. It seems like work well. However there are problem with vocabulary key. I want to make korean word2vec which use unicode. For example word 사과 which means apology in english and it's unicode is \xC0AC\xACFC If I try to find 사과 in word2vec, key error occur...
Instead of \xc0ac\xacfc \xc0ac and \xacfc stores separately.What's the reason and how to solve it?

word2vec词汇键

Python Gensim word2vec词汇键

问题描述

推荐答案