问题描述
因此,我正在尝试学习和理解Doc2Vec.我正在关注教程.我的输入是文档列表,即单词列表.这是我的代码:
So,I'm trying to learn and understand Doc2Vec.I'm following this tutorial. My input is a list of documents i.e list of lists of words. This is what my code looks like:
input = [["word1","word2",..."wordn"],["word1","word2",..."wordn"],...]
documents = TaggedLineDocument(input)
model = doc2vec.Doc2Vec(documents,size = 50, window = 10, min_count = 2, workers=2)
但是我遇到了一些unicode错误(尝试谷歌搜索该错误,但是不好):
But I am getting some unicode error(tried googling this error, but no good ):
TypeError('don\'t know how to handle uri %s' % repr(uri))
有人可以帮我了解我要去哪里错吗?谢谢 !
Can somebody please help me understand where i am going wrong ? Thank you !
推荐答案
TaggedLineDocument应该使用文件路径实例化.确保以一种文档等于一行的格式设置文件.
TaggedLineDocument should be instantiated with a file path. Make sure the file is setup in the format one document equals one line.
documents = TaggedLineDocument('myfile.txt')
documents = TaggedLineDocument('compressed_text.txt.gz')
从源代码:
uri
(您想使用其实例化TaggedLineDocument)可以是:
The uri
(the think you are instantiating TaggedLineDocument with) can be either:
1. a URI for the local filesystem (compressed ``.gz`` or ``.bz2`` files handled automatically):
`./lines.txt`, `/home/joe/lines.txt.gz`, `file:///home/joe/lines.txt.bz2`
2. a URI for HDFS: `hdfs:///some/path/lines.txt`
3. a URI for Amazon's S3 (can also supply credentials inside the URI):
`s3://my_bucket/lines.txt`, `s3://my_aws_key_id:key_secret@my_bucket/lines.txt`
4. an instance of the boto.s3.key.Key class.
这篇关于Doc2vec:TaggedLineDocument()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!