用于查找具有高交点的集合的最快算法

本文介绍了用于查找具有高交点的集合的最快算法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有大量的用户ID（整数），可能有数百万。这些用户都属于各种组（整数集），因此有1000万个组的数量。

为了简化我的例子，并获得我们假设所有的组都包含20个用户ID。

我想找到所有成对的整数集，其交点为15或更大。

我应该比较每一套吗？（如果我保留一个数据结构来映射userID以设置成员身份，那么这不是必要的。）最快的方法是什么？也就是说，我的底层数据结构应该用于表示整数集？排序集，未排序---可以哈希有什么帮助？我应该使用什么算法来计算集合交集）？我喜欢有关C / C ++（特别是STL）的答案，但也欢迎任何更通用的算法洞察。

更新
另外，请注意，我将在共享内存环境中并行运行，所以优先考虑将并行扩展到并行解决方案。另外，请注意，绝大多数的set对都将具有0的交集大小，这意味着使用映射的数据结构可能是有利的用户ID设置以避免计算每对集合的交集。

解决方案

我将完全按照您的建议：地图用户到他们的团体。也就是说，我将保留每个用户的组ID列表。然后我将使用以下算法：

  foreach组：
 map = new Map< Group，int> //映射组计数
组中的foreach用户
 user.groups中的foreach userGroup：
 map [userGroup] ++ 
 if（map [userGroup] == 15& ;& userGroup.id> group.id）
 largeIntersection（group，userGroup）

假设您有 G 每个包含 U 用户的用户，并且考虑到这些用户属于 g 组，那么这将在 O（G * U * g）中运行。鉴于您的问题，可能比在 O（G * G * U）中运行的组的初始成对比较快得多。

I have a large number of user IDs (integers), potentially millions. These users all belong to various groups (sets of integers), such that there are on the order of 10 million groups.

To simplify my example and get to the essence of it, let's assume that all groups contain 20 user IDs.

I want to find all pairs of integer sets that have an intersection of 15 or greater.

Should I compare every pair of sets? (If I keep a data structure that maps userIDs to set membership, this would not be necessary.) What is the quickest way to do this? That is, what should my underlying data structure be for representing the integer sets? Sorted sets, unsorted---can hashing somehow help? And what algorithm should I use to compute set intersection)? I prefer answers that relate C/C++ (especially STL), but also any more general, algorithmic insights are welcome.

UpdateAlso, note that I will be running this in parallel in a shared memory environment, so ideas that cleanly extend to a parallel solution are preferred.

Also, note that the vast majority of set-pairs will have an intersection size of 0---meaning that it might be advantageous to use a data-structure that mapped user IDs to sets to avoid calculating the intersection of every pair of sets.

解决方案

I would do exactly what you propose: map users to their group. That is, I would keep a list of group ids for every user. Then I would use the following algorithm:

foreach group:
  map = new Map<Group, int>  // maps groups to count
  foreach user in group:
    foreach userGroup in user.groups:
      map[userGroup]++
      if( map[userGroup] == 15 && userGroup.id > group.id )
        largeIntersection( group, userGroup )

Given you have G groups each containing U users on average, and given that these users belong to g groups on average, then this will run in O( G*U*g ). Which, given your problem, is probably much faster than the naive pairwise comparison of groups which runs in O(G*G*U).

这篇关于用于查找具有高交点的集合的最快算法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！