问题描述
我将在群集中我的帐户受空间配额非常限制的群集上使用nltk.tokenize.word_tokenize
.在家里,我通过nltk.download()
下载了所有nltk
资源,但是据我发现,它占用了约2.5GB.
I am going to use nltk.tokenize.word_tokenize
on a cluster where my account is very limited by space quota. At home, I downloaded all nltk
resources by nltk.download()
but, as I found out, it takes ~2.5GB.
对我来说,这似乎有些矫kill过正.您能否建议nltk.tokenize.word_tokenize
的最小(或几乎最小)依赖性?到目前为止,我已经看过nltk.download('punkt')
,但是我不确定它是否足够,大小如何.为了使它正常工作,我应该运行什么?
This seems a bit overkill to me. Could you suggest what are the minimal (or almost minimal) dependencies for nltk.tokenize.word_tokenize
? So far, I've seen nltk.download('punkt')
but I am not sure whether it is sufficient and what is the size. What exactly should I run in order to make it work?
推荐答案
您是对的.您需要Punkt Tokenizer模型.它有13 MB,nltk.download('punkt')
应该可以解决问题.
You are right. You need Punkt Tokenizer Models. It has 13 MB and nltk.download('punkt')
should do the trick.
这篇关于要使nltk.tokenize.word_tokenize正常工作要下载什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!