问题描述
首先,我将文件内容标记为句子,然后在每个句子上调用Stanford NER.但是这个过程确实很慢.我知道是否可以更快地在整个文件内容上调用它,但是我在每个句子上都调用它,因为我想索引NE识别前后的每个句子.
First I tokenize the file content into sentences and then call Stanford NER on each of the sentences. But this process is really slow. I know if I call it on the whole file content if would be faster, but I'm calling it on each sentence as I want to index each sentence before and after NE recognition.
st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
for filename in filelist:
sentences = sent_tokenize(filecontent) #break file content into sentences
for j,sent in enumerate(sentences):
words = word_tokenize(sent) #tokenize sentences into words
ne_tags = st.tag(words) #get tagged NEs from Stanford NER
这可能是由于每个句子都调用了st.tag()
,但是有什么方法可以使其运行得更快?
This is probably due to calling st.tag()
for each sentence, but is there any way to make it run faster?
编辑
我想标记句子分开的原因是我想将句子写到文件中(例如句子索引),以便在以后给定带有ne标记的句子时,我可以获得未处理的句子(我也在这里做词形化
The reason that I want to tag sentences separate is that I want to write sentences to a file (like sentence indexing) so that given the ne tagged sentence at a later stage, i can get the unprocessed sentence (i'm also doing lemmatizing here)
文件格式:
推荐答案
来自 StanfordNERTagger ,有tag_sents()
函数,请参见 https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L68
From StanfordNERTagger, there is the tag_sents()
function, see https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L68
>>> st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
>>> tokenized_sents = [[word_tokenize(sent) for sent in sent_tokenize(filecontent)] for filename in filelist]
>>> st.tag_sents(tokenized_sents)
这篇关于如何使用python nltk的斯坦福大学NER加快NE识别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!