问题描述
我正在寻找解决此问题的最节省内存的方法.
I'm looking for the most memory efficient way to solve this problem.
我有一个表示句子中部分字符串匹配的元组列表:
I have a list of tuples representing partial string matches in a sentence:
[(0, 2), (1, 2), (0, 4), (2,6), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]
每个元组的第一个值是匹配的开始位置,第二个值是长度.
The first value of each tuple is the start position for the match, the second value is the length.
这个想法是折叠列表,以便只报告最长的继续字符串匹配.在这种情况下,它将是:
The idea is to collapse the list so that only the longest continue string match is reported. In this case it would be:
[(0,4), (2,6), (22,6)]
我不想要最长的范围,比如找到最长非重叠序列的算法,但我想要所有范围折叠最长.
I do not want just the longest range, like in algorithm to find longest non-overlapping sequences, but I want all the ranges collapsed by the longest.
如果您想知道,我正在使用 Aho-Corasick 的纯 Python 实现将静态字典中的术语与给定的文本片段进行匹配.
In case your wondering, I am using a pure python implementation of the Aho-Corasick for matching terms in a static dictionary to the given text snippet.
由于这些元组列表的性质,重叠但不是独立的范围应该单独打印出来.例如,字典中有 betaz
和 zeta
两个词,betazeta
的匹配项是 [(0,5),(4,8)]
.由于这些范围重叠,但不包含在另一个范围内,因此答案应该是 [(0,5),(4,8)]
.我还修改了上面的输入数据集,以便涵盖这种情况.
Due to the nature of these tuple lists, overlapping but not self-contained ranges should be printed out individually. For example, having the words betaz
and zeta
in the dictionary, the matches for betazeta
are [(0,5),(4,8)]
. Since these ranges overlap, but none is contained in the other, the answer should be [(0,5),(4,8)]
. I have also modified the input dataset above so that this case is covered.
谢谢!
推荐答案
import operator
lst = [(0, 2), (1, 2), (0, 4), (2,6), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]
lst.sort(key=operator.itemgetter(1))
for i in reversed(xrange(len(lst)-1)):
start, length = lst[i]
for j in xrange(i+1, len(lst)):
lstart, llength = lst[j]
if start >= lstart and start + length <= lstart + llength:
del lst[i]
break
print lst
#[(0, 4), (2, 6), (22, 6)]
这篇关于将范围元组列表折叠到重叠范围中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!