问题描述
我需要在字符串中找到最长的序列,但需要注意的是,该序列必须重复三次或更多次.因此,例如,如果我的字符串是:
I need to find the longest sequence in a string with the caveat that the sequence must be repeated three or more times. So, for example, if my string is:
fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld
然后我希望返回值helloworld".
我知道有几种方法可以做到这一点,但我面临的问题是实际的字符串大得离谱,所以我真的在寻找一种可以及时完成的方法.
I know of a few ways of accomplishing this but the problem I'm facing is that the actual string is absurdly large so I'm really looking for a method that can do it in a timely fashion.
推荐答案
这个问题是的变体最长重复子串问题,并且有一个 O(n) 时间算法来解决它,它使用 后缀树.这个想法(如维基百科所建议的)是构建一个后缀树(时间 O(n)),用后代的数量(时间 O(n) 使用 DFS)注释树中的所有节点,然后找到树中至少有三个后代的最深节点(使用 DFS 的时间为 O(n)).这个整体算法需要时间 O(n).
This problem is a variant of the longest repeated substring problem and there is an O(n)-time algorithm for solving it that uses suffix trees. The idea (as suggested by Wikipedia) is to construct a suffix tree (time O(n)), annotate all the nodes in the tree with the number of descendants (time O(n) using a DFS), and then to find the deepest node in the tree with at least three descendants (time O(n) using a DFS). This overall algorithm takes time O(n).
也就是说,众所周知,后缀树很难构建,因此在尝试此实现之前,您可能希望找到一个为您实现后缀树的 Python 库.一个快速的谷歌搜索出现了这个库,虽然我不确定这是否是一个很好的实现.
That said, suffix trees are notoriously hard to construct, so you would probably want to find a Python library that implements suffix trees for you before attempting this implementation. A quick Google search turns up this library, though I'm not sure whether this is a good implementation.
另一种选择是将后缀数组与LCP 数组.您可以迭代 LCP 数组中的相邻元素对,取每对中的最小值,并存储您以这种方式找到的最大数字.这将对应于重复至少 3 次的最长字符串的长度,然后您可以从那里读出字符串本身.
Another option would be to use suffix arrays in conjunction with LCP arrays. You can iterate over pairs of adjacent elements in the LCP array, taking the minimum of each pair, and store the largest number you find this way. That will correspond to the length of the longest string that repeats at least three times, and from there you can then read off the string itself.
有几种简单的算法可用于构建后缀数组(Manber-Myers 算法在 O(n log n) 时间内运行并且不太难编码),而 Kasai 的算法在 O(n) 时间内构建 LCP 数组并且编码起来相当简单.
There are several simple algorithms for building suffix arrays (the Manber-Myers algorithm runs in time O(n log n) and isn't too hard to code up), and Kasai's algorithm builds LCP arrays in time O(n) and is fairly straightforward to code up.
希望这有帮助!
这篇关于找出字符串中最长的重复序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!