任务是对由多个单词组成的表达式进行分组(akaMulti-Word Expressions)。
给定一个mwe字典,我需要在检测到mwe的输入语句中添加破折号,例如。

**Input:** i have got an ace of diamonds in my wet suit .
**Output:** i have got an ace-of-diamonds in my wet-suit .

目前,我在分类字典中循环查看mwe是否出现在句子中,并在出现时替换它们。但是有很多浪费的迭代。
有更好的办法吗一种解决方案是首先产生所有可能的n-克,即chunker2()
import re, time
mwe_list =set([i.strip() for i in codecs.open( \
            "wn-mwe-en.dic","r","utf8").readlines()])

def chunker(sentence):
  for item in mwe_list:
    if item or item.replace("-", " ") in sentence:
      #print item
      mwe_item =  '-'.join(item.split(" "))
      r=re.compile(re.escape(mwe_item).replace('\\-','[- ]'))
      sentence=re.sub(r,mwe_item,sentence)
  return sentence

def chunker2(sentence):
    nodes = []
    tokens = sentence.split(" ")
    for i in range(0,len(tokens)):
        for j in range(i,len(tokens)):
            nodes.append(" ".join(tokens[i:j]))
    n = sorted(set([i for i in nodes if not "" and len(i.split(" ")) > 1]))

    intersect = mwe_list.intersection(n)

    for i in intersect:
        print i
        sentence = sentence.replace(i, i.replace(" ", "-"))

    return sentence

s = "i have got an ace of diamonds in my wet suit ."

time.clock()
print chunker(s)
print time.clock()

time.clock()
print chunker2(s)
print time.clock()

最佳答案

我试着这样做:
对于每个句子,构造一组n-grams,长度不超过给定的长度(列表中最长的mwe)。
现在,只需执行mwe_nmgrams.intersection(sentence_ngrams)并搜索/替换它们。
您不必通过遍历原始集合中的所有项来浪费时间。
这里有一个稍快的chunker2版本:

def chunker3(sentence):
    tokens = sentence.split(' ')
    len_tokens = len(tokens)
    nodes = set()

    for i in xrange(0, len_tokens):
        for j in xrange(i, len_tokens):
            chunks = tokens[i:j]

            if len(chunks) > 1:
                nodes.add(' '.join(chunks))

    intersect = mwe_list.intersection(n)

    for i in intersect:
        print i
        sentence = sentence.replace(i, i.replace(' ', '-'))

    return sentence

08-25 03:26