python - 最长的公共(public)子串，无需砍字-python

给定以下条件，我可以找到最长的公共子串：

s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

print longest_common_substring(s1, s2)

[出局]：

foo bar

但是，我如何确保最长的公共子串尊重英语单词边界，而不是切掉一个单词例如，以下句子：

s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)

输出以下内容，这是不需要的，因为它将单词kappa从s2中分离出来：

a foo bar

所需的输出仍然是：

foo bar

我也尝试过一种ngram方法来获得与单词边界相关的最长公共子串，但是有没有其他方法可以在不计算ngram的情况下处理字符串？（见答案）

最佳答案

这太简单了，无法理解我用你的代码做了75%的工作。
我先把句子分成几个单词，然后把它传递给你的函数，得到最大的公共子串（在这种情况下，它将是最长的连续单词），所以你的函数给我[“foo”，“bar”]，我加入数组的元素以产生所需的结果。
这是在线工作副本，供您测试、验证和处理。
http://repl.it/RU0/1

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

def longest_common_sentence(s1, s2):
    s1_words = s1.split(' ')
    s2_words = s2.split(' ')
    return ' '.join(longest_common_substring(s1_words, s2_words))


s1 = 'this is a foo bar sentence .'
s2 = 'what a kappa foo bar black sheep ?'
common_sentence = longest_common_sentence(s1, s2)
print common_sentence
>> 'foo bar'

边箱
“.”和“？”如果最后一个单词和标点符号之间有空格，也会被视为有效单词。如果你不留下一个空格，它们将被算作最后一句话的一部分。在那种情况下，“羊”和“羊？”不再是同一个词了在调用此类函数之前，由您决定如何处理此类字符。那样的话
import re
s1 = re.sub('[.?]','', s1)
s2 = re.sub('[.?]','', s2)
然后像往常一样继续。