问题描述
给定以下内容,我可以找到最长的公共子串:
Given the following, i can find the longest common substring:
s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
print longest_common_substring(s1, s2)
[输出]:
foo bar
但是我如何确保最长的公共子串尊重英文单词边界并且不切割单词?例如下面的句子:
But how do i ensure that the longest common substring respect English word boundary and don't cut up a word? For example, the following sentences:
s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)
输出不需要的后续内容,因为它从 s2 中分解了单词 kappa
:
outputs the follow which is NOT desired since it breaks up the word kappa
from s2:
a foo bar
所需的输出仍然是:
foo bar
我也尝试过一种 ngram 方法来获取最长公共子串,但是否有其他方法可以在不计算 ngrams 的情况下处理字符串?(见答案)
I've tried also an ngram way of getting the longest common substring respecting word boundary but is there other way that deals with strings without calculating ngrams? (see answer)
推荐答案
这太简单了,看不懂.我使用您的代码完成了 75% 的工作.我首先将句子拆分为单词,然后将其传递给您的函数以获得最大的公共子字符串(在这种情况下它将是最长的连续单词),因此您的函数给了我 ['foo', 'bar'],我加入该数组的元素以产生所需的结果.
This is too simple to understand. I used your code to do 75% of the job.I first split the sentence into words, then pass it to your function to get the largest common substring(in this case it will be longest consecutive words), so your function gives me ['foo', 'bar'], I join the elements of that array to produce the desired result.
这里是在线工作副本,供您测试、验证和修改.
Here is the online working copy for you to test and verify and fiddle with it.
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
def longest_common_sentence(s1, s2):
s1_words = s1.split(' ')
s2_words = s2.split(' ')
return ' '.join(longest_common_substring(s1_words, s2_words))
s1 = 'this is a foo bar sentence .'
s2 = 'what a kappa foo bar black sheep ?'
common_sentence = longest_common_sentence(s1, s2)
print common_sentence
>> 'foo bar'
边缘情况
'.'和 '?'如果最后一个单词和标点符号之间有空格,则在您的情况下也被视为有效单词.如果您不留下空格,它们将被视为最后一句话的一部分.在那种情况下,羊"和羊"?不再是相同的词了.在调用此类函数之前,由您决定如何处理此类字符.那样的话
'.' and '?' are also treated as valid words as in your case if there is a space between last word and the punctuation mark. If you don't leave a space they will be counted as part of last word. In that case 'sheep' and 'sheep?' would not be same words anymore. Its up to you decide what to do with such characters, before calling such function. In that case
重新导入
s1 = re.sub('[.?]','', s1)
s2 = re.sub('[.?]','', s2)
然后像往常一样继续.
这篇关于最长公共子串不切字-python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!