问题描述
我在 NLP 管道的第一步中使用 spaCy(版本 2.0.11)进行词形还原,但不幸的是它需要很长时间.这显然是我的处理流程中最慢的部分,我想知道我是否可以进行改进.我正在使用管道作为:
I'm using spaCy (version 2.0.11) for lemmatization in the first step of my NLP pipeline but unfortunately it's taking a verrry long time. It is clearly the slowest part of my processing pipeline and I want to know if there are improvements I could be making. I am using a pipeline as:
nlp.pipe(docs_generator, batch_size=200, n_threads=6, disable=['ner'])
在一台 8 核机器上,我已经验证该机器正在使用所有内核.
on a 8 core machine, and I have verified that the machine is using all the cores.
在大约 300 万个短文本的语料库中,总大小接近 2GB,需要将近 24 小时来词形还原和写入磁盘.合理吗?
On a corpus of about 3 million short texts totaling almost 2gb it takes close to 24hrs to lemmatize and write to disk. Reasonable?
我尝试禁用处理管道的几个部分,发现它破坏了词形还原(解析器、标记器).
I have tried disabling a couple parts of the processing pipeline and found that it broke the lemmatization (parser, tagger).
除了命名实体识别之外,默认处理管道中是否还有一些不需要词形还原的部分?
是否有其他方法可以加快 spaCy 词形还原过程?
旁白:
文档似乎也没有列出解析管道中的所有操作.在 spacy Language 类的顶部,我们有:
It also appears that documentation doesn't list all the operations in the parsing pipeline. At the top of the spacy Language class we have:
factories = {
'tokenizer': lambda nlp: nlp.Defaults.create_tokenizer(nlp),
'tensorizer': lambda nlp, **cfg: Tensorizer(nlp.vocab, **cfg),
'tagger': lambda nlp, **cfg: Tagger(nlp.vocab, **cfg),
'parser': lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg),
'ner': lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg),
'similarity': lambda nlp, **cfg: SimilarityHook(nlp.vocab, **cfg),
'textcat': lambda nlp, **cfg: TextCategorizer(nlp.vocab, **cfg),
'sbd': lambda nlp, **cfg: SentenceSegmenter(nlp.vocab, **cfg),
'sentencizer': lambda nlp, **cfg: SentenceSegmenter(nlp.vocab, **cfg),
'merge_noun_chunks': lambda nlp, **cfg: merge_noun_chunks,
'merge_entities': lambda nlp, **cfg: merge_entities
}
其中包括此处文档中未涵盖的一些项目:https://spacy.io/usage/processing-pipelines
which includes some items not covered in the documentation here:https://spacy.io/usage/processing-pipelines
由于它们没有被涵盖,我真的不知道哪些可能被禁用,也不知道它们的依赖项是什么.
Since they are not covered I don't really know which may be disabled, nor what their dependencies are.
推荐答案
我发现你也可以禁用 spacy 管道的解析器部分,只要你添加句子分割器.这不是疯狂的快,但它绝对是一个改进——在测试中,时间看起来大约是我之前所做的事情的 1/3(当我只是禁用 'ner' 时).这是我现在所拥有的:
I found out you can disable the parser portion of the spacy pipeline as well, as long as you add the sentence segmenter. It's not crazy fast but it is definitely an improvement--in tests the time looks to be about 1/3 of what I was doing before (when I was just disabling 'ner'). Here is what I have now:
nlp = spacy.load('en', disable=['ner', 'parser'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))
这篇关于如何加速 spaCy 词形还原?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!