问题描述
我想在gensim上使用python训练word2vec模型在英语维基百科上.我密切关注了 https://groups.google.com/forum/#!topic/gensim/MJWrDw_IvXw .
I want to train a word2vec model on the english wikipedia using python with gensim. I closely followed https://groups.google.com/forum/#!topic/gensim/MJWrDw_IvXw for that.
它对我有用,但是我对生成的word2vec模型不满意的是命名实体被拆分,这使得该模型无法用于我的特定应用程序.我需要的模型必须将命名实体表示为单个向量.
It works for me but what I don't like about the resulting word2vec model is that named entities are split which makes the model unusable for my specific application. The model I need has to represent named entities as a single vector.
这就是为什么我计划使用spacy解析Wikipedia文章并将北卡罗莱纳州"之类的实体合并为"north_carolina",以便word2vec将它们表示为单个向量的原因.到目前为止一切顺利.
Thats why I planned to parse the wikipedia articles with spacy and merge entities like "north carolina" into "north_carolina", so that word2vec would represent them as a single vector. So far so good.
spacy解析必须是预处理的一部分,我最初是按照链接讨论中的建议使用的:
The spacy parsing has to be part of the preprocessing, which I originally did as recommended in the linked discussion using:
...
wiki = WikiCorpus(wiki_bz2_file, dictionary={})
for text in wiki.get_texts():
article = " ".join(text) + "\n"
output.write(article)
...
这将删除标点符号,停用词,数字和大写字母,并将每篇文章保存在结果输出文件中的单独一行中.问题在于,spacy的NER在此预处理文本上并没有真正起作用,因为我猜它依赖于NER(?)的标点和大写.
This removes punctuation, stop words, numbers and capitalization and saves each article in a separate line in the resulting output file. The problem is that spacy's NER doesn't really work on this preprocessed text, since I guess it relies on punctuation and capitalization for NER (?).
有人知道我是否可以禁用" gensim的预处理,以便它不会删除标点符号等,但仍将Wikipedia文章直接从压缩的Wikipedia转储中解析为文本吗?还是有人知道更好的方法来做到这一点?预先感谢!
推荐答案
您可以在spaCy中使用gensim word2vec预训练模型,但是这里的问题是您处理管道的顺序:
You can use a gensim word2vec pretrained model in spaCy, but the problem here is your processing pipeline's order:
- 您将文本传递给gensim
- Gensim解析并标记字符串
- 您将令牌标准化
- 您将令牌传递回spaCy
- 您创建了一个w2v语料库(带有spaCy)(?)
这意味着当spaCy获得文档时,文档已经被标记,是的,它的NER很...复杂: https://www.youtube.com/watch?v=sqDHBH9IjRU
That means the docs are already tokenized when spaCy gets them, and yes, its NER is... complex: https://www.youtube.com/watch?v=sqDHBH9IjRU
您可能想做的是:
- 您将文本传递给spaCy
- spaCy使用 NER 解析它们
- spaCy相应地对它们进行标记化,将实体保留为一个标记
- 您使用spacy.load()加载gensim w2v模型
- 您使用加载的模型在spaCy中创建w2v语料库
- You pass the texts to spaCy
- spaCy parses them with NER
- spaCy tokenizes them accordingly, keeping entities as one token
- you load the gensim w2v model with spacy.load()
- you use the loaded model to create the w2v corpus in spaCy
您需要做的就是从gensim下载模型,并告诉spaCy从命令行查找它:
All you need to do is download the model from gensim and tell spaCy to look for it from the command line:
- wget [模型网址]
- python -m初始化模型[选项] [您刚刚下载的文件]
以下是init-model的命令行文档: https://spacy.io/api/cli#init-model
Here is the command line documentation for init-model: https://spacy.io/api/cli#init-model
然后像en_core_web_md一样加载它,例如您可以使用.txt,.zip或.tgz模型.
then load it just like en_core_web_md, e.g. You can use .txt, .zip or .tgz models.
这篇关于解析维基语料库时禁用Gensim删除标点符号等的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!