问题描述
我设计了一种算法来查找最长的公共子序列.这些是步骤:
I've designed an algorithm to find the longest common subsequence. these are steps:
-
选择第一个字符串中的第一个字母.
Pick the first letter in the first string.
在第二个字符串中查找它,如果找到它,则将该字母添加到common_subsequence
并将其位置存储在index
中,否则比较common_subsequence
的长度和lcs
的长度如果值更大,则将其值赋给lcs
.
Look for it in the second string and if its found, Add that letter tocommon_subsequence
and store its position in index
, Otherwisecompare the length of common_subsequence
with the length of lcs
and if its greater, asign its value to lcs
.
返回第一个字符串并选择下一个字母,然后重复再次执行上一步,但是这次从第index
个字母
Return to the first string and pick the next letter and repeat theprevious step again, But this time start searching from index
th letter
重复此过程,直到第一个字符串中没有字母为止挑选.最后,lcs
的值是最长公共子序列.
Repeat this process until there is no letter in the first string topick. At the end the value of lcs
is the Longest CommonSubsequence.
这是一个示例:
This is an example:
X=A, B, C, B, D, A, B
Y=B, D, C, A, B, A
在第一个字符串中选择A
.
在Y
中查找A
.
现在第二个字符串中有一个A
,将其附加到common_subsequence
.
返回第一个字符串并选择下一个字母B
.这次从A
的位置开始在第二个字符串中查找B
.
在A
之后有一个B
,因此将B附加到common_subsequence
.
现在,在第一个字符串C
中选择下一个字母.第二个字符串中的B
旁边没有C
.因此,将common_subsequence的值分配给lcs
,因为它的长度大于lcs
的长度.重复前面的步骤,直到到达第一个字符串的末尾.最后,lcs
的值是最长公共子序列.
Pick A
in the first string.
Look for A
in Y
.
Now that there is an A
in the second string, append it to common_subsequence
.
Return to the first string and pick the next letter that is B
.Look for B
in the second string this time starting from the position of A
.
There is a B
after A
so append B to common_subsequence
.
Now pick the next letter in the first string that is C
. There isn't a C
next to B
in the second string. So assign the value of common_subsequence to lcs
because its length is greater than the length of lcs
.repeat the previous steps until reaching the end of the first string. In the end the value of lcs
is the Longest Common Subsequence.
此算法的复杂度为theta(n * m).这是我的实现:
The complexity of this algorithm is theta(n*m).Here is my implementations:
第一个算法:
import time
def lcs(xstr, ystr):
if not (xstr and ystr): return # if string is empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
for i in xrange(len(xstr)):
cs = '' # common subsequence
start = 0 # start position in ystr
for item in xstr[i:]:
index = ystr.find(item, start) # position at the common letter
if index != -1: # if common letter has found
cs += item # add common letter to the cs
start = index + 1
if index == len(ystr) - 1: break # if reached end of the ystr
# update lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed
使用哈希表的相同算法:
The same algorithm using hash table:
import time
from collections import defaultdict
def lcs(xstr, ystr):
if not (xstr and ystr): return # if strings are empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
location = defaultdict(list) # keeps track of items in the ystr
i = 0
for k in ystr:
location[k].append(i)
i += 1
for i in xrange(len(xstr)):
cs = '' # common subsequence
index = -1
reached_index = defaultdict(int)
for item in xstr[i:]:
for new_index in location[item][reached_index[item]:]:
reached_index[item] += 1
if index < new_index:
cs += item # add item to the cs
index = new_index
break
if index == len(ystr) - 1: break # if reached end of the ystr
# update lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed
推荐答案
如果您的教授希望您发明自己的LCS算法,那么您就完成了.您的算法并不是有史以来最优化的算法,但是它属于正确的复杂性类,您可以清楚地理解它,并且您显然没有从互联网复制实现.您可能需要准备捍卫自己的算法,或讨论替代方法.如果我是您的教授,则在以下情况下给您A:
If your professor wants you to invent your own LCS algorithm, you're done. Your algorithm is not the most optimal one ever created, but it's in the right complexity class, you clearly understand it, and you clearly didn't copy your implementation from the internet. You might want to be prepared to defend your algorithm, or discuss alternatives. If I were your prof, I'd give you an A if:
- 您打开了该程序.
- 您能够解释为什么没有可能的O(N)或O(N log M)替代方案.
- 您能够参加关于其他算法的合理讨论,这些算法可能具有更好的 lower 界限(或常数,等等),以及时间/空间折衷等,甚至如果您事先不知道讨论的结果.
- You turned in that program.
- You were able to explain why there's no possible O(N) or O(N log M) alternative.
- You were able to participate in a reasonable discussion about other algorithms that might have a better lower bound (or significantly lower constants, etc.), and the time/space tradeoffs, etc., even if you didn't know the outcome of that discussion in advance.
另一方面,如果您的教授希望您选择一种众所周知的算法并编写自己的实现,则您可能希望使用标准的LP算法.由于某种原因,这是一种标准算法-您可能需要继续阅读,直到了解为止. (即使不进行测试,您也要上这堂课来学习,而不仅仅是打动教授,对吧?)
On the other hand, if your professor wants you to pick one of the well-known algorithms and write your own implementation, you probably want to use the standard LP algorithm. It's a standard algorithm for a reason—which you probably want to read up on until you understand. (Even if it isn't going to be on the test, you're taking this class to learn, not just to impress the prof, right?)
维基百科具有用于基本实现的伪代码,然后是常用优化的英语描述.我非常确定,基于该页面上的内容编写自己的Python代码不会被视为窃,甚至不会被视为琐碎的端口,特别是如果您可以证明自己了解代码在做什么,为什么,为什么以及为什么的话这是一个很好的算法.另外,您正在用Python编写该代码,该代码具有比该文章中演示的更好的方式来进行记忆,因此,如果您了解它的工作原理,那么您的代码实际上应该比Wikipedia所提供的更好.
Wikipedia has pseudocode for a basic implementation, then English-language descriptions of common optimizations. I'm pretty sure that writing your own Python code based on what's on that page wouldn't count as plagiarism, or even as a trivial port, especially if you can demonstrate that you understand what your code is doing, and why, and why it's a good algorithm. Plus, you're writing it in Python, which has much better ways to memoize than what's demonstrated in that article, so if you understand how it works, your code should actually be substantially better than what Wikipedia gives you.
无论哪种方式,正如我在评论中建议的那样,我都会阅读对最长的通用子序列算法的调查,作者是Bergroth,Hakonen和Raita,并在网上搜索类似的论文.
Either way, as I suggested in the comments, I'd read A survey of longest common subsequence algorithms by Bergroth, Hakonen, and Raita, and search for similar papers online.
这篇关于这是可以接受的算法吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!