问题描述
如何使用Doc2vec获取两个文本文档的文档向量?我是新手,因此如果有人可以向我指出正确的方向/帮助我进行一些教程,这将很有帮助
How to get document vectors of two text documents using Doc2vec?I am new to this, so it would be helpful if someone could point me in the right direction / help me with some tutorial
我正在使用gensim.
I am using gensim.
doc1=["This is a sentence","This is another sentence"]
documents1=[doc.strip().split(" ") for doc in doc1 ]
model = doc2vec.Doc2Vec(documents1, size = 100, window = 300, min_count = 10, workers=4)
我知道
每当我运行此程序时.
推荐答案
如果要训练Doc2Vec模型,则数据集需要包含单词(类似于Word2Vec格式)和标签(文档ID)的列表.它还可以包含一些其他信息(请参见 https ://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb 了解更多信息).
If you want to train Doc2Vec model, your data set needs to contain lists of words (similar to Word2Vec format) and tags (id of documents). It can also contain some additional info (see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb for more information).
# Import libraries
from gensim.models import doc2vec
from collections import namedtuple
# Load data
doc1 = ["This is a sentence", "This is another sentence"]
# Transform data (you can add more data preprocessing steps)
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
words = text.lower().split()
tags = [i]
docs.append(analyzedDocument(words, tags))
# Train model (set min_count = 1, if you want the model to work with the provided example data set)
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)
# Get the vectors
model.docvecs[0]
model.docvecs[1]
更新(如何训练时间):这个例子已经过时了,所以我删除了它.有关时代训练的更多信息,请参见此答案或@gojomo的评论.
UPDATE (how to train in epochs):This example became outdated, so I deleted it. For more information on training in epochs, see this answer or @gojomo's comment.
这篇关于Doc2vec:如何获取文档向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!