java - PairWise匹配数百万条记录

我手头有一个算法问题。为了轻松解释问题，我将使用一个简单的类比。
我有一个输入文件

Country,Exports
Austrailia,Sheep
US, Apple
Austrialia,Beef

最终目标:
我必须找到两个国家之间的共同产品

{"Austrailia,New Zealand"}:{"apple","sheep}
{"Austrialia,US"}:{"apple"}
{"New Zealand","US"}:{"apple","milk"}

工艺流程:

我读了输入并将其存储在TreeMap> List中，由于重复很多，所以字符串被插入。
本质上，我按国家汇总。
其中键是国家/地区，值是其出口。

{"austrailia":{"apple","sheep","koalas"}}
{"new zealand":{"apple","sheep","milk"}}
{"US":{"apple","beef","milk"}}

我有大约1200个键(国家)，并且值(出口)的总数总计为8000万。
我对每个键的所有值进行排序:

{"austrailia":{"apple","sheep","koalas"}} -- > {"austrailia":{"apple","koalas","sheep"}}

这是快速的，因为只有1200个列表可以排序。

for(k1:keys)
   for(k2:keys)
        if(k1.compareTo(k2) <0){ //Dont want to double compare
    List<String> intersectList = intersectList_func(k1's exports,k2's exports);
        countriespair.put({k1,k2},intersectList)
}

这个代码块花了很长时间。我意识到它是O(n2)，并且大约是1200 * 1200比较。因此，到目前为止运行了将近3个小时。
有什么办法，我可以加快或优化它。
明智的算法是最好的选择，或者还有其他技术需要考虑。

编辑:
由于两个List都是预先排序的，所以intersectList是O(n)，其中n是地板的长度(listOne.length，listTwo.length)，而不是O(n2)，如下所述

private static List<String> intersectList(List<String> listOne,List<String> listTwo){
        int i=0,j=0;
        List<String> listResult = new LinkedList<String>();
        while(i!=listOne.size() && j!=listTwo.size()){
            int compareVal = listOne.get(i).compareTo(listTwo.get(j));
            if(compareVal==0){
                listResult.add(listOne.get(i));
                i++;j++;}               }
            else if(compareVal < 0) i++;
            else if (compareVal >0) j++;
        }
        return listResult;
    }

11月22日更新
我当前的实现仍在运行将近18个小时。 :|

更新11月25日
我已经按照Vikram和其他一些人的建议运行了新的实现。这个星期五一直在运行。
我的问题是，按出口分组而不是按国家分组如何节省计算复杂性。我发现复杂度是相同的。正如Groo所述，我发现第二部分的复杂度为O(E * C ^ 2)，其中E是出口，C是国家。

最佳答案

这可以使用SQL在一条语句中作为自联接完成:

测试数据。首先创建一个测试数据集:

Lines <- "Country,Exports
Austrailia,Sheep
Austrailia,Apple
New Zealand,Apple
New Zealand,Sheep
New Zealand,Milk
US,Apple
US,Milk
"
DF <- read.csv(text = Lines, as.is = TRUE)

sqldf 现在我们有了DF发出以下命令:

library(sqldf)
sqldf("select a.Country, b.Country, group_concat(Exports) Exports
   from DF a, DF b using (Exports)
   where a.Country < b.Country
   group by a.Country, b.Country
")

给出以下输出:

      Country     Country     Exports
1  Austrailia New Zealand Sheep,Apple
2  Austrailia          US       Apple
3 New Zealand          US  Apple,Milk

带有索引的如果索引太慢，请在“国家/地区”列中添加索引(并确保不要忘记main.部分:
sqldf(c("create index idx on DF(Country)", "select a.Country, b.Country, group_concat(Exports) Exports from main.DF a, main.DF b using (Exports) where a.Country < b.Country group by a.Country, b.Country "))

如果内存用完，则添加dbname = tempfile() sqldf参数，以便它使用磁盘。