任务是对由多个单词组成的表达式进行分组(akaMulti-Word Expressions
)。
给定一个mwe字典,我需要在检测到mwe的输入语句中添加破折号,例如。
**Input:** i have got an ace of diamonds in my wet suit .
**Output:** i have got an ace-of-diamonds in my wet-suit .
目前,我在分类字典中循环查看mwe是否出现在句子中,并在出现时替换它们。但是有很多浪费的迭代。
有更好的办法吗一种解决方案是首先产生所有可能的n-克,即
chunker2()
import re, time
mwe_list =set([i.strip() for i in codecs.open( \
"wn-mwe-en.dic","r","utf8").readlines()])
def chunker(sentence):
for item in mwe_list:
if item or item.replace("-", " ") in sentence:
#print item
mwe_item = '-'.join(item.split(" "))
r=re.compile(re.escape(mwe_item).replace('\\-','[- ]'))
sentence=re.sub(r,mwe_item,sentence)
return sentence
def chunker2(sentence):
nodes = []
tokens = sentence.split(" ")
for i in range(0,len(tokens)):
for j in range(i,len(tokens)):
nodes.append(" ".join(tokens[i:j]))
n = sorted(set([i for i in nodes if not "" and len(i.split(" ")) > 1]))
intersect = mwe_list.intersection(n)
for i in intersect:
print i
sentence = sentence.replace(i, i.replace(" ", "-"))
return sentence
s = "i have got an ace of diamonds in my wet suit ."
time.clock()
print chunker(s)
print time.clock()
time.clock()
print chunker2(s)
print time.clock()
最佳答案
我试着这样做:
对于每个句子,构造一组n-grams,长度不超过给定的长度(列表中最长的mwe)。
现在,只需执行mwe_nmgrams.intersection(sentence_ngrams)
并搜索/替换它们。
您不必通过遍历原始集合中的所有项来浪费时间。
这里有一个稍快的chunker2
版本:
def chunker3(sentence):
tokens = sentence.split(' ')
len_tokens = len(tokens)
nodes = set()
for i in xrange(0, len_tokens):
for j in xrange(i, len_tokens):
chunks = tokens[i:j]
if len(chunks) > 1:
nodes.add(' '.join(chunks))
intersect = mwe_list.intersection(n)
for i in intersect:
print i
sentence = sentence.replace(i, i.replace(' ', '-'))
return sentence