我可以使用什么算法来确定一组字符串中的公共字符?
为了使示例更简单,我只关心一行中的2+个字符,以及它是否出现在示例中的2个或更多字符中。例如:
0000ABCDe0000
0000ABCD00000
000ABC0000000个
00ABC000DE000型
我想知道:
00用于1,2,3,4
000用于1,2,3,4
0000用于1,2,3
00000用于2,3
ab用于1,2,3,4
abc用于1,2,3,4
abcd用于1,2
bc用于1,2,3,4
BCD用于1,2
CD用于1,2
de用于1,4

最佳答案

我想这不是家庭作业。(如果是的话,你就是你自己的抄袭者!;-)
下面是一个快速而肮脏的解决方案。时间复杂度其中O(m**2 * n)是平均字符串长度,m是字符串数组的大小。
n的实例保留包含给定字符串的索引集。Occurrence例程扫描字符串数组,为每个非空字符串调用commonOccurrences。对于给定的字符串的每个可能的子字符串,captureOccurrences例程将当前索引放入captureOccurrences中。最后,Occurrence通过只选取那些至少有两个索引的commonOccurrences来形成结果集。
请注意,您的示例数据有许多比您在问题中识别的更常见的子字符串例如,Occurrences出现在每个输入字符串中根据内容(例如,所有数字、所有字母等)选择感兴趣字符串的另一个过滤器是——正如他们所说——留给读者作为练习。;-)
快速而肮脏的JAVA源代码:

package com.stackoverflow.answers;

import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;

public class CommonSubstringFinder {

    public static final int MINIMUM_SUBSTRING_LENGTH = 2;

    public static class Occurrence implements Comparable<Occurrence> {
        private final String value;
        private final Set<Integer> indices;
        public Occurrence(String value) {
            this.value = value == null ? "" : value;
            indices = new TreeSet<Integer>();
        }
        public String getValue() {
            return value;
        }
        public Set<Integer> getIndices() {
            return Collections.unmodifiableSet(indices);
        }
        public void occur(int index) {
            indices.add(index);
        }
        public String toString() {
            StringBuilder result = new StringBuilder();
            result.append('"').append(value).append('"');
            String separator = ": ";
            for (Integer i : indices) {
                result.append(separator).append(i);
                separator = ",";
            }
            return result.toString();
        }
        public int compareTo(Occurrence that) {
            return this.value.compareTo(that.value);
        }
    }

    public static Set<Occurrence> commonOccurrences(String[] strings) {
        Map<String,Occurrence> work = new HashMap<String,Occurrence>();
        if (strings != null) {
            int index = 0;
            for (String string : strings) {
                if (string != null) {
                    captureOccurrences(index, work, string);
                }
                ++index;
            }
        }
        Set<Occurrence> result = new TreeSet<Occurrence>();
        for (Occurrence occurrence : work.values()) {
            if (occurrence.indices.size() > 1) {
                result.add(occurrence);
            }
        }
        return result;
    }

    private static void captureOccurrences(int index, Map<String,Occurrence> work, String string) {
        final int maxLength = string.length();
        for (int i = 0; i < maxLength; ++i) {
            for (int j = i + MINIMUM_SUBSTRING_LENGTH; j < maxLength; ++j) {
                String partial = string.substring(i, j);
                Occurrence current = work.get(partial);
                if (current == null) {
                    current = new Occurrence(partial);
                    work.put(partial, current);
                }
                current.occur(index);
            }
        }
    }

    private static final String[] TEST_DATA = {
        "0000abcde0000",
        "0000abcd00000",
        "000abc0000000",
        "00abc000de000",
    };
    public static void main(String[] args) {
        Set<Occurrence> found = commonOccurrences(TEST_DATA);
        for (Occurrence occurrence : found) {
            System.out.println(occurrence);
        }
    }

}

示例输出:(注意,实际上每行只有一个匹配项;我似乎无法阻止blockquote标记合并行)
“00”:0,1,2,3
“000”:0,1,2,3
“0000”:0,1,2
“0000A”:0,1
“0000AB”:0,1
“0000ABC”:0,1
“0000ABCD”:0,1
“000A”:0,1,2
“000AB”:0,1,2
“000ABC”:0,1,2
“000ABCD”:0,1
“00a”:0,1,2,3
“00AB”:0,1,2,3
“00ABC”:0,1,2,3
“00ABC0”:2,3
“00abc00”:2,3
“00ABC000”:2,3
“00ABCD”:0,1
“0a”:0,1,2,3
“0ab”:0,1,2,3
“0ABC”:0,1,2,3
“0ABC0”:2,3
“0ABC00”:2,3
“0ABC000”:2,3
“0ABCD”:0,1
“AB”:0,1,2,3
“ABC”:0,1,2,3
“ABC0”:2,3
“abc00”:2,3
“ABC000”:2,3
“abcd”:0,1
“BC”:0,1,2,3
“BC0”:2,3
“BC00”:2,3
“bc000”:2,3
“bcd”:0,1
“c0”:2,3
“C00”:2,3
“C000”:2,3
“cd”:0,1
“de”:0,3
“de0”:0,3
“DE00”:0,3
“e0”:0,3
“e00”:0,3

09-07 00:03