本文介绍了从重叠的池中挑选无序组合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有值的集合,我想通过从某些集合中进行选择来生成所有可能的无序组合。

I have pools of values and I would like to generate every possible unordered combination by picking from certain pools.

例如,我想从集合0中进行选择,池0和池1:

For example, I wanted to pick from pool 0, pool 0, and pool 1:

>>> pools = [[1, 2, 3], [2, 3, 4], [3, 4, 5]]
>>> part = (0, 0, 1)
>>> list(product(*(pools[i] for i in part)))
[(1, 1, 2), (1, 1, 3), (1, 1, 4), (1, 2, 2), (1, 2, 3), (1, 2, 4), (1, 3, 2), (1, 3, 3), (1, 3, 4), (2, 1, 2), (2, 1, 3), (2, 1, 4), (2, 2, 2), (2, 2, 3), (2, 2, 4), (2, 3, 2), (2, 3, 3), (2, 3, 4), (3, 1, 2), (3, 1, 3), (3, 1, 4), (3, 2, 2), (3, 2, 3), (3, 2, 4), (3, 3, 2), (3, 3, 3), (3, 3, 4)]

这会通过从池0,池0和池1中进行选择来生成每种可能的组合。

This generates every possible combination by picking from pool 0, pool 0, and pool 1.

但是顺序对我来说并不重要,因此许多组合实际上是重复的。例如,由于我使用的是笛卡尔积,因此(1、2、4)(2、1、4)

However order doesn't matter to me, so many of the combinations are actually duplicates. For example, since I used a Cartesian product, both (1, 2, 4) and (2, 1, 4) are generated.

我想出了一种简单的方法来缓解此问题。对于从单个池中挑选的成员,我选择时不使用 combinations_with_replacement 进行排序。我计算要从每个池中抽奖的次数。代码如下所示:

I came up with a simple method to mitigate this issue. For members picked from a single pool, I select without ordering using combinations_with_replacement. I count how many times I want to draw from each pool. The code looks like this:

cnt = Counter()
for ind in part: cnt[ind] += 1
blocks = [combinations_with_replacement(pools[i], cnt[i]) for i in cnt]
return [list(chain(*combo)) for combo in product(*blocks)]

如果我碰巧多次从同一个池中进行选择,这可以减少重复订购的次数。但是,所有池都有很多重叠,并且在合并的多个池上使用 combinations_with_replacement 会生成一些无效的组合。有没有更有效的方法来生成无序组合?

This reduces ordering duplicates if I happen to choose from the same pool multiple times. However all the pools have lots of overlap, and using combinations_with_replacement on multiple pools merged would generate some invalid combinations. Is there a more efficient method to generate unordered combinations?

编辑:有关输入的额外信息:零件和池的数量很小(〜5和〜20),为简单起见,每个元素都是整数。我已经解决了实际问题,所以这只是出于学术目的。假设每个池中有成千上万个整数,但是有些池很小,只有几十个。因此,某种结合或相交似乎是可行的方法。

Extra info about the inputs: The number of parts and pools is small (~5 and ~20), and for simplicity, each element is an integer. The actual problem I have already solved so this is just for academic interest. Let's say there are hundreds of integers in each pool but some pools are small and only have dozens. So some kind of union or intersection seems to be the way to go.

推荐答案

这不是答案,而是促使人们更加努力地思考;-)为了具体,我将将OP的代码(略有排斥)包装在一个也清除重复项的函数中:

This isn't "an answer" so much as a spur to think harder ;-) For concreteness, I'll wrap the OP's code, slightly respelled, in a function that also weeds out duplicates:

def gen(pools, ixs):
    from itertools import combinations_with_replacement as cwr
    from itertools import chain, product
    from collections import Counter

    assert all(0 <= i < len(pools) for i in ixs)
    seen = set()
    cnt = Counter(ixs) # map index to count
    blocks = [cwr(pools[i], count) for i, count in cnt.items()]
    for t in product(*blocks):
        t = tuple(sorted(chain(*t)))
        if t not in seen:
            seen.add(t)
            yield t

我不担心在这里排序-它是内存-高效,对于小型元组来说,它可能比创建 Counter 对象所涉及的所有开销都要快。

I don't fear sorting here - it's memory-efficient, and for small tuples is likely faster than all the overheads involved in creating a Counter object.

但是不管怎么说,这里的重点是强调通过重新构造问题以使用 combinations_with_replacement cwr )。请考虑以下输入:

But regardless of that, the point here is to emphasize the real value the OP got by reformulating the problem to use combinations_with_replacement (cwr). Consider these inputs:

N = 64
pools = [[0, 1]]
ixs = [0] * N

只有65个唯一结果,该函数会立即生成它们,而不会内部重复。另一方面,基本上相同的

There are only 65 unique results, and the function generates them instantaneously, with no internal duplicates at all. On the other hand, the essentially identical

pools = [[0, 1]] * N
ixs = range(N)

也具有相同的65个唯一结果,但实际上可以永久运行(例如,到目前为止给出的其他答案),努力解决2 ** 64种可能性。 cwr 在这里无济于事,因为每个池索引仅出现一次。

also has the same 65 unique results, but essentially runs forever (as would, e.g, the other answers so far given), slogging through 2**64 possibilities. cwr doesn't help here because each pool index appears only once.

因此,在任何方面都有很大的改进空间解决方案仅淘汰了完整笛卡尔积的重复项,而其中一些可以通过执行OP已经完成的操作来赢得。

So there's astronomical room for improvement over any solution that "merely" weeds out duplicates from a full Cartesian product, and some of that can be won by doing what the OP already did.

有希望的方法是编写一个自定义生成器(不是一个主要依赖 itertools 函数的生成器),该生成器以字典顺序生成所有可能的开始(因此,从构造上讲,不会重复)被创建为开始)。但这首先需要对输入池进行一些全局分析,而且我开始使用的代码很快变得比现在花时间去解决的复杂。

It seems to me the most promising approach would be to write a custom generator (not one relying primarily on itertools functions) that generated all possibilities in lexicographic order to begin with (so, by construction, no duplicates would be created to begin with). But that requires some "global" analysis of the input pools first, and the code I started on quickly got more complex than I can make time to wrestle with now.

cwr 与@ user2357112的增量重复数据删除功能结合在一起,可以提供一种简短的算法,该算法可以在我拥有的所有测试用例。例如,上面的 [0,1] ** 64 例子的两个拼写基本上都是瞬时的,并在@Joseph Wood的答案结尾处以大约相同的速度运行该例子正如他说的那样,他的C ++代码运行了(在Python 3.7.0下,我的计算机上运行了0.35秒,是的,找到了162295个结果):

Combining cwr with @user2357112's incremental de-duplicating gives a brief algorithm that runs fast on all the test cases I have. For example, it's essentially instantaneous for both spellings of the [0, 1] ** 64 examples above, and runs the example at the end of @Joseph Wood's answer approximately as fast as he said his C++ code ran (0.35 seconds on my box under Python 3.7.0, and, yes, found 162295 results):

def gen(pools, ixs):
    from itertools import combinations_with_replacement as cwr
    from collections import Counter

    assert all(0 <= i < len(pools) for i in ixs)
    result = {()}
    for i, count in Counter(ixs).items():
        result = {tuple(sorted(old + new))
                  for new in cwr(pools[i], count)
                  for old in result}
    return result

为使其他Python程序员更容易尝试上一个示例,以下是作为可执行Python的输入:

To make it easier for other Pythonistas to try the last example, here's the input as executable Python:

pools = [[1, 10, 14, 6],
         [7, 2, 4, 8, 3, 11, 12],
         [11, 3, 13, 4, 15, 8, 6, 5],
         [10, 1, 3, 2, 9, 5,  7],
         [1, 5, 10, 3, 8, 14],
         [15, 3, 7, 10, 4, 5, 8, 6],
         [14, 9, 11, 15],
         [7, 6, 13, 14, 10, 11, 9, 4]]
ixs = range(len(pools))

但是,OP随后补充说,它们通常有大约20个池,每个池包含数千个元素。 1000 ** 20 = 1e60对于构建完整笛卡尔积的任何方法都无法实现,无论它多么巧妙地清除重复项。不过,仍然很清楚,他们希望有多少人能够重复工作,也很清楚,这种增量重复数据删除是否足够好以至于实用。

However, the OP later added that they typically have about 20 pools, each with some thousands of elements. 1000**20 = 1e60 is waaaaay out of practical reach for any approach that builds the full Cartesian product, no matter how cleverly it weeds out duplicates. It remains clear as mud how many they expect to be duplicates, though, so also clear as mud whether this kind of "incremental de-duplicating" is good enough to be practical.

理想情况下,我们将有一个生成器,按字典顺序一次生成一个结果。

Ideally we'd have a generator that yielded one result at a time, in lexicographic order.

在增量重复数据删除的基础上,假设我们有严格增加的(字典式)排序元组序列,并附加相同的元组 T 并重新排序。这样,导出的序列仍然严格按升序排列。例如,在左列中,我们有 range(4)中的10个唯一对,在右列中,我们向每个后附加(并再次排序)2个后会发生什么:

Building on the incremental de-duplication, suppose we have a strictly increasing (lexicographic) sequence of sorted tuples, append the same tuple T to each, and sort each again. Then the derived sequence is still in strictly increasing order. For example, in the left column we have the 10 unique pairs from range(4), and in the right column what happens after we append (and sort again) 2 to each:

00 002
01 012
02 022
03 023
11 112
12 122
13 123
22 222
23 223
33 233

它们以排序顺序开始,派生的三元组也按照排序顺序。我将跳过简单的证明(略过:如果 t1 t2 是相邻的元组,则 t1< t2 ,并令 i 为最小索引,以使 t1 [i]!= t2 [i] 。然后 t1 [i]< t2 [i] (词典学<的意思是),然后如果抛出 x 并入两个元组,具体情况如下: x< = t1 [i] ?在 t1之间吗? i] t2 [i] ?是 x> = t2 [i] 在每种情况下,很容易看到第一个派生元组严格小于第二个派生元组。)

They started in sorted order, and the derived triples are also in sorted order. I'll skip the easy proof (sketch: if t1 and t2 are adjacent tuples, t1 < t2, and let i be the smallest index such that t1[i] != t2[i]. Then t1[i] < t2[i] (what "lexicographic <" means). Then if you throw x into both tuples, proceed by cases: is x <= t1[i]? between t1[i] and t2[i]? is x >= t2[i]? In each case it's easy to see that the first derived tuple remains strictly less then the second derived tuple.)

因此,假设我们有一个排序后的序列从某些数量的池中获得所有唯一排序元组的结果,当我们将新池 P 的元素添加到元组时会发生什么?好吧,如上所述,

So supposing we have a sorted sequence result of all unique sorted tuples from some number of pools, what happens when we add elements of a new pool P into the tuples? Well, as above,

[tuple(sorted(old + (P[0],))) for old in result]

也已排序,

[tuple(sorted(old + (P[i],))) for old in result]

$对于范围(len(P))中的所有 i b
$ b

。这些可以保证已经排序的序列可以通过 heapq.merge()和另一个生成器( killdups())合并)在合并结果上运行,以即时清除重复项。不需要,例如,保留到目前为止看到的所有元组的集合。因为合并的输出是非递减的,所以仅检查下一个结果是否与最后一个结果输出相同就足够了。

for all i in range(len(P)). These guaranteed already-sorted sequences can be merged via heapq.merge(), and another generator (killdups() below) run on the merge result to weed out duplicates on the fly. There's no need to, e.g., keep a set of all tuples seen so far. Because the output of the merge is non-decreasing, it's sufficient just to check whether the next result is the same as the last result output.

让它懒惰地工作是精致。添加的新池中的每个元素都必须访问整个结果序列,但是我们不想一口气将整个事情具体化。相反, itertools.tee()允许下一个缓冲池的每个元素按自己的步调遍历结果-先后顺序,并自动为每个结果项释放内存

Getting this to work lazily is delicate. The entire result-so-far sequence has to be accessed by each element of the new pool being added, but we don't want to materialize the whole thing in one gulp. Instead itertools.tee() allows each element of the next pool to traverse the result-so-far sequence at its own pace, and automatically frees memory for each result item after all new pool elements have finished with it.

需要函数 build1()(或类似方法)来确保在正确的时间访问正确的值。例如,如果 build1()的主体被内联粘贴到调用位置,则代码将失败(该主体将访问绑定到<$ c $的最终值) c> rcopy new ,而不是它们在生成生成器表达式时被绑定的内容。)

The function build1() (or some workalike) is needed to ensure that the right values are accessed at the right times. For example, if the body of build1() is pasted in inline where it's called, the code will fail spectacularly (the body would access the final values bound to rcopy and new instead of what they were bound to at the time the generator expression was created).

总而言之,由于生成器调用和堆合并的延迟,这当然要慢一些。作为回报,它以字典顺序返回结果,可以开始非常快速地交付结果,并且如果没有其他原因导致最终结果序列根本没有实现,则峰值内存负担较低(很少)完成直到调用者遍历返回的生成器为止。

In all, of course this is somewhat slower, due to layers of delayed generator calls and heap merges. In return, it returns the results in lexicographic order, can start delivering results very quickly, and has lower peak memory burden if for no other reason than that the final result sequence isn't materialized at all (little is done until the caller iterates over the returned generator).

技术说明:不要担心 sorted()这里。通过 old + new 完成附加操作,原因如下: old 已经排序,并且 new 通常是一个1元组。在这种情况下,Python的排序是线性时间,而不是 O(N log N)

Tech note: don't fear sorted() here. The appending is done via old + new for a reason: old is already sorted, and new is typically a 1-tuple. Python's sort is linear-time in this case, not O(N log N).

def gen(pools, ixs):
    from itertools import combinations_with_replacement as cwr
    from itertools import tee
    from collections import Counter
    from heapq import merge

    def killdups(xs):
        last = None
        for x in xs:
            if x != last:
                yield x
                last = x

    def build1(rcopy, new):
        return (tuple(sorted(old + new)) for old in rcopy)

    assert all(0 <= i < len(pools) for i in ixs)
    result = [()]
    for i, count in Counter(ixs).items():
        poolelts = list(cwr(pools[i], count))
        xs = [build1(rcopy, new)
              for rcopy, new in zip(tee(result, len(poolelts)),
                                    poolelts)]
        result = killdups(merge(*xs))
    return result



2个输入



结果表明,对于2输入的情况,有一个简单的方法是从集合代数派生的。如果 x y 相同,则 cwr(x,2)是答案。如果 x y 不相交,则 product(x,y)。否则, x y 的交集 c 不是-空,答案是从3个成对不相交集 c xc 获得的4个叉积的级联,和 yc cwr(c,2) product(xc,c) product(yc,c) product(xc,yc)。证明简单明了但乏味,因此我将跳过它。例如,在 cwr(c,2) product(xc,c)之间没有重复项,因为每个元组后者包含 xc 中的元素,但前者中的每个元组仅包含 c 和<$中的元素c $ c> xc 和 c 在构造上是不相交的。在 product(xc,yc)没有重复项,因为两个输入是不相交的(如果它们包含一个共同的元素,那将是在 x y 的交集中,与 xc 在交叉点中没有元素)。等等。

2 inputs

Turns out that for the 2-input case there's an easy approach derived from set algebra. If x and y are the same, cwr(x, 2) is the answer. If x and y are disjoint, product(x, y). Else the intersection c of x and y is non-empty, and the answer is the catenation of 4 cross-products obtained from the 3 pairwise-disjoint sets c, x-c, and y-c: cwr(c, 2), product(x-c, c), product(y-c, c), and product(x-c, y-c). Proof is straightforward but tedious so I'll skip it. For example, there are no duplicates between cwr(c, 2) and product(x-c, c) because every tuple in the latter contains an element from x-c, but every tuple in the former contains elements only from c, and x-c and c are disjoint by construction. There are no duplicates within product(x-c, y-c) because the two inputs are disjoint (if they contained an element in common, that would have been in the intersection of x and y, contradicting that x-c has no element in the intersection). Etc.

可惜,我没有找到一种方法可以将这两个输入项之外的内容一概而论,这让我感到惊讶。它可以单独使用,也可以在其他方法中用作构建块。例如,如果有很多输入,则可以搜索具有大交点的对,然后使用这种2输入方案直接处理整个产品的部分

Alas, I haven't found a way to generalize this beyond 2 inputs, which surprised me. It can be used on its own, or as a building block in other approaches. For example, if there are many inputs, they can be searched for pairs with large intersections, and this 2-input scheme used to do those parts of the overall products directly.

即使只有3个输入,我也不清楚如何获得正确的结果

Even at just 3 inputs, it's not clear to me how to get the right result for

[1, 2], [2, 3], [1, 3]

完整的笛卡尔积有2 ** 3 = 8个元素,其中只有一个重复:(1、2、3)出现两次(如(1、2、2 3),并再次作为(2,3,1))。每对输入都有一个1元素的交集,但所有3个元素的交集都是空的。

The full Cartesian product has 2**3 = 8 elements, only one of which repeats: (1, 2, 3) appears twice (as (1, 2, 3) and again as (2, 3, 1)). Each pair of inputs has a 1-element intersection, but the intersection of all 3 is empty.

这里是一个实现:

def pair(x, y):
    from itertools import product, chain
    from itertools import combinations_with_replacement
    x = set(x)
    y = set(y)
    c = x & y
    chunks = []
    if c:
        x -= c
        y -= c
        chunks.append(combinations_with_replacement(c, 2))
        if x:
            chunks.append(product(x, c))
        if y:
            chunks.append(product(y, c))
    if x and y:
        chunks.append(product(x, y))
    return chain.from_iterable(chunks)



基于最大匹配的概念证明



这融合了@Leon的草图思想和@JosephWoods的注释方法。它没有打磨,显然可以加快速度,但是在我尝试过的所有情况下,速度都相当快。因为它相当复杂,所以以一种已经很难理解的未优化形式发布它可能更有用!

A Proof-of-Concept Based on Maximal Matching

This blends ideas from @Leon's sketch and an approach @JosephWoods sketched in comments. It's not polished, and can obviously be sped up, but it's reasonably quick on all the cases I tried. Because it's rather complex, it's probably more useful to post it in an already-hard-enough-to-follow un-optimized form anyway!

尝试确定空闲池的集合(如@Leon的草图所示)。主要是因为我没有代码可言,部分原因是目前还不清楚如何有效地完成任务。我确实有代码可以在二分图中找到 a 匹配,在这种情况下只需要进行一些更改即可。

This doesn't make any attempt to determine the set of "free" pools (as in @Leon's sketch). Primarily because I didn't have code sitting around for that, and partly because it wasn't immediately clear how to accomplish that efficiently. I did have code sitting around to find a match in a bipartite graph, which required only a few changes to use in this context.

所以像@JosephWood的草图一样,它按照字典顺序尝试 plausible 结果前缀,并且每个人都通过检查是否存在二部图匹配来查看是否实际上可以构造。

So this tries plausible result prefixes in lexicographic order, as in @JosephWood's sketch, and for each sees whether it's actually possible to construct via checking whether a bipartite-graph match exists.

因此,虽然@Leon草图的详细信息在这里基本上未实现,但可见的行为却大体相同:它以字典顺序产生结果,不需要检查重复项,这是一个惰性生成器,峰值内存使用与池长度的 sum 成正比,显然可以通过多种方式将其并行化(设置不同的进程以在结果空间的不同区域上工作),以及它主要是更快地在于减少图匹配功能完成的大量冗余工作(每次调用时所做的大量工作只是复制) ces。)。

So while the details of @Leon's sketch are largely unimplemented here, the visible behaviors are much the same: it produces results in lexicographic order, it doesn't need to check for duplicates, it's a lazy generator, peak memory use is proportional to the sum of the lengths of the pools, it can obviously be parallelized in many ways (set different processes to work on different regions of the result space), and the key to making it majorly faster lies in reducing the massive amounts of redundant work done by the graph-matching function (a great deal of what it does on each call merely reproduces what it did on the previous call).

def matchgen(pools, ixs):
    from collections import Counter
    from collections import defaultdict
    from itertools import chain, repeat, islice

    elt2pools = defaultdict(set)
    npools = 0
    for i, count in Counter(ixs).items():
        set_indices = set(range(npools, npools + count))
        for elt in pools[i]:
            elt2pools[elt] |= set_indices
        npools += count
    elt2count = {elt : len(ps) for elt, ps in elt2pools.items()}

    cands = sorted(elt2pools.keys())
    ncands = len(cands)

    result = [None] * npools

    # Is it possible to match result[:n] + [elt]*count?
    # We already know it's possible to match result[:n], but
    # this code doesn't exploit that.
    def match(n, elt, count):

        def extend(x, seen):
            for y in elt2pools[x]:
                if y not in seen:
                    seen.add(y)
                    if y in y2x:
                        if extend(y2x[y], seen):
                            y2x[y] = x
                            return True
                    else:
                        y2x[y] = x
                        return True
            return False

        y2x = {}
        freexs = []
        # A greedy pass first to grab easy matches.
        for x in chain(islice(result, n), repeat(elt, count)):
            for y in elt2pools[x]:
                if y not in y2x:
                    y2x[y] = x
                    break
            else:
                freexs.append(x)
        # Now do real work.
        seen = set()
        for x in freexs:
            seen.clear()
            if not extend(x, seen):
                return False
        return True

    def inner(i, j): # fill result[j:] with elts from cands[i:]
        if j >= npools:
            yield tuple(result)
            return
        for i in range(i, ncands):
            elt = cands[i]
            # Find the most times `elt` can be added.
            count = min(elt2count[elt], npools - j)
            while count:
                if match(j, elt, count):
                    break
                count -= 1
            # Since it can be added `count` times, it can also
            # be added any number of times less than `count`.
            for k in range(count):
                result[j + k] = elt
            while count:
                yield from inner(i + 1, j + count)
                count -= 1

    return inner(0, 0)

编辑:请注意,这里有一个潜在的陷阱,以一对池 range(10_000) range(100_000)来说明。产生(9999,99999)之后,第一个位置增加到10000,然后持续很长时间,推断出10001中的任何可能性都不匹配。 。99999在第二位置;然后在第一位置的10001与第二位置的10002 .. 99999中的任何可能性都不匹配;等等。 @Leon的方案会注意到, range(10_000)是剩下的唯一自由池,它在第一位置上选了10000,然后立刻注意到 range(10_000)不包含任何大于10000的值。显然,对于第一个位置的10001、10002,...,99999,显然需要再次执行该操作。这是线性时间而不是二次时间的浪费,但都是一样的浪费。这个故事的寓意:在您有实际的代码可以尝试之前不要信任任何东西;-)

note that there's a potential trap here, illustrated by the pair of pools range(10_000) and range(100_000). After producing (9999, 99999), the first position increments to 10000, and then it continues for a very long time deducing that there's no match for any of the possibilities in 10001 .. 99999 in the second position; and then for 10001 in the first position no match for any of the possibilities in 10002 .. 99999 in the second position; and so on. @Leon's scheme instead would have noted that range(10_000) was the only free pool remaining having picked 10000 in the first position, and noted at once then that range(10_000) doesn't contain any values greater than 10000. It would apparently need to do that again for 10001, 10002, ..., 99999 in the first position. That's a linear-time rather than quadratic-time waste of cycles, but a waste all the same. Moral of the story: don't trust anything until you have actual code to try ;-)

以下是@Leon想法的忠实实现。我喜欢上面的代码比上面的概念验证(POC)代码更好,但是很惊讶地发现新代码的运行速度显着降低(在类似于@JospephWood的各种情况下,运行速度降低了3到4倍)例如,相对于POC代码的优化变体。

Following is a more-than-less faithful implementation of @Leon's ideas. I like the code better than my "proof of concept" (POC) code just above, but was surprised to find that the new code runs significantly slower (a factor of 3 to 4 times slower on a variety of cases akin to @JospephWood's randomized example) relative to a comparably "optimized" variant of the POC code.

主要原因似乎是对匹配函数的更多调用。 POC代码针对每个合理前缀调用一次。新代码不会生成任何不可能的前缀,但是对于它生成的每个前缀,可能需要进行多次 match()调用,以确定剩余的空闲池可能较小。也许有一种更聪明的方法。

The primary reason appears to be more calls to the matching function. The POC code called that once per "plausible" prefix. The new code doesn't generate any impossible prefixes, but for each prefix it generates may need to make multiple match() calls to determine the possibly smaller set of free pools remaining. Perhaps there's a cleverer way to do that.

请注意,我增加了一个转折:如果一个自由池的元素都小于前缀的最后一个元素,则它相对于前缀仍然是空闲池,但是它没有用,因为其任何元素都不能出现在候选项中。这与结果无关紧要,但是这意味着对于所有剩余的递归调用,空闲池集中的池仍然存在,这反过来又意味着它们会浪费时间确定它仍然是空闲池 。因此,当一个空闲池不能再用于任何东西时,此版本会将其从一组空闲池中删除。

Note that I added one twist: if a free pool's elements are all smaller than the last element of the prefix so far, it remains "a free pool" with respect to the prefix, but it's useless because none of its elements can appear in the candidates. This doesn't matter to the outcome, but it means the pool remains in the set of free pools for all remaining recursive calls, which in turn means they can waste time determining that it's still a "free pool". So when a free pool can no longer be used for anything, this version removes it from the set of free pools. This gave a significant speedup.

注意:有很多尝试匹配的方法,其中有些具有更好的理论 O()最坏的情况。以我的经验,在典型情况下,简单的深度优先(如此处)搜索在现实生活中运行得更快。但这在很大程度上取决于当前应用程序中典型图形的特征​​。我没有在这里尝试其他方法。

Note: there are many ways to try matching, some of which have better theoretical O() worst-case behavior. In my experience, simple depth-first (as here) search runs faster in real life on typical cases. But it depends very much on characteristics of what "typical" graphs look like in the application at hand. I haven't tried other ways here.

底线,忽略了 2输入的特殊情况代码:

Bottom lines, ignoring the "2 inputs" special-case code:


  • 如果有RAM,那么这里没有什么比提高重复数据删除速度更快的了。但是没有什么比峰值内存负担更糟糕的了。

  • Nothing here beats incremental de-duplication for speed, if you have the RAM. But nothing is worse than that for peak memory burden.

没有什么比基于匹配的方法节省内存的负担了。从这个角度来看,他们处在一个完全不同的世界中。它们也是最慢的,尽管至少在同一宇宙中;-)

Nothing beats the matching-based approaches for frugal memory burden. They're in an entirely different universe on that measure. They're also the slowest, although at least in the same universe ;-)

代码:

def matchgen(pools, ixs):
    from collections import Counter
    from collections import defaultdict
    from itertools import islice

    elt2pools = defaultdict(list)
    allpools = []
    npools = 0
    for i, count in Counter(ixs).items():
        indices = list(range(npools, npools + count))
        plist = sorted(pools[i])
        for elt in plist:
            elt2pools[elt].extend(indices)
        for i in range(count):
            allpools.append(plist)
        npools += count
    pools = allpools
    assert npools == len(pools)

    result = [None] * npools

    # Is it possible to match result[:n] not using pool
    # bady?  If not, return None.  Else return a matching,
    # a dict whose keys are pool indices and whose values
    # are a permutation of result[:n].
    def match(n, bady):

        def extend(x, seen):
            for y in elt2pools[x]:
                if y not in seen:
                    seen.add(y)
                    if y not in y2x or extend(y2x[y], seen):
                        y2x[y] = x
                        return True
            return False

        y2x = {}
        freexs = []
        # A greedy pass first to grab easy matches.
        for x in islice(result, n):
            for y in elt2pools[x]:
                if y not in y2x and y != bady:
                    y2x[y] = x
                    break
            else:
                freexs.append(x)

        # Now do real work.
        for x in freexs:
            if not extend(x, {bady}):
                return None
        return y2x

    def inner(j, freepools): # fill result[j:]
        from bisect import bisect_left
        if j >= npools:
            yield tuple(result)
            return
        if j:
            new_freepools = set()
            allcands = set()
            exhausted = set()  # free pools with elts too small
            atleast = result[j-1]
            for pi in freepools:
                if pi not in new_freepools:
                    m = match(j, pi)
                    if not m:  # match must use pi
                        continue
                    # Since `m` is a match to result[:j],
                    # any pool in freepools it does _not_
                    # use must still be free.
                    new_freepools |= freepools - m.keys()
                    assert pi in new_freepools
                # pi is free with respect to result[:j].
                pool = pools[pi]
                if pool[-1] < atleast:
                    exhausted.add(pi)
                else:
                    i = bisect_left(pool, atleast)
                    allcands.update(pool[i:])
            if exhausted:
                freepools -= exhausted
                new_freepools -= exhausted
        else: # j == 0
            new_freepools = freepools
            allcands = elt2pools.keys()

        for result[j] in sorted(allcands):
            yield from inner(j + 1, new_freepools)

    return inner(0, set(range(npools)))

注意:这有其自己的坏案例类。例如,传递128个 [0,1] 副本要花大约2分钟(!)的时间来查找129个结果。 POC代码花了不到一秒钟的时间,而某些不匹配的方法却是瞬间出现的。

Note: this has its own classes of "bad cases". For example, passing 128 copies of [0, 1] consumes about 2 minutes(!) of time on my box to find the 129 results. The POC code takes under a second, while some of the non-matching approaches appear instantaneous.

我不会详细说明原因。只需说一句,因为所有池都是相同的,因此无论递归进行的深度如何,它们 all 始终是空闲池。 match()从来没有遇到困难,总是在它的第一遍(贪婪的)遍中找到前缀的完全匹配,但即使那样也要花时间与当前平方成比例前缀长度(==当前递归深度)。

I won't go into detail about why. Suffice it to say that because all the pools are the same, they all remain "free pools" no matter how deep the recursion goes. match() never has a hard time, always finding a complete match for the prefix in its first (greedy) pass, but even that takes time proportional to the square of the current prefix length (== the current recursion depth).

这里还有一个。如前所述,基于匹配的方法由于经常重复进行基本运算而遭受了图匹配的费用,并且不幸的是,很容易碰到一些不好的情况。

One more here. As noted before, the matching-based approaches suffer some from the expense of graph matching as a fundamental operation repeated so often, and have some unfortunate bad cases pretty easy to stumble into.

高度相似的池会导致一组空闲池缓慢收缩(或完全不收缩)。但是在那种情况下,池是如此相似,以至于从哪个池中获取元素几乎无关紧要。因此,下面的方法不会尝试准确跟踪空闲池,只要能够明显选择可用池,就选择任意池,并且仅在遇到阻塞时才进行图匹配。这似乎很好。举一个极端的例子,来自128个 [0,1] 池的129个结果在不到十分之一秒而不是两分钟的时间内交付。事实证明,在这种情况下,不需要进行图形匹配。

Highly similar pools cause the set of free pools to shrink slowly (or not at all). But in that case the pools are so similar that it rarely matters which pool an element is taken from. So the approach below doesn't try to keep exact track of free pools, picks arbitrary pools so long as such are obviously available, and resorts to graph-matching only when it gets stuck. That seems to work well. As an extreme example, the 129 results from 128 [0, 1] pools are delivered in less than a tenth of second instead of in two minutes. It turns out it never needs to do graph-matching in that case.

POC代码的另一个问题(对于其他基于匹配的方法则更少)最后结果交付后很长一段时间内仍有可能使车轮旋转。务实的破解完全解决了一个问题;-)序列的最后一个元组很容易预先计算,并且代码引发了内部异常,以在最后一个元组交付后立即结束所有内容。

Another problem with the POC code (and less so for the other match-based approach) was the possibility of spinning wheels for a long time after the last result was delivered. A pragmatic hack solves that one completely ;-) The last tuple of the sequence is easily computed in advance, and code raises an internal exception to end everything immediately after the last tuple is delivered.

就给我!对两个输入情况的一般化对我来说仍然很有趣,但是我从其他方法获得的所有痒现在都已被解决。

That's it for me! A generalization of the "two inputs" case would remain very interesting to me, but all the itches I got from the other approaches have been scratched now.

def combogen(pools, ixs):
    from collections import Counter
    from collections import defaultdict
    from itertools import islice

    elt2pools = defaultdict(set)
    npools = 0
    cands = []
    MAXTUPLE = []
    for i, count in Counter(ixs).items():
        indices = set(range(npools, npools + count))
        huge = None
        for elt in pools[i]:
            elt2pools[elt] |= indices
            for i in range(count):
                cands.append(elt)
            if huge is None or elt > huge:
                huge = elt
        MAXTUPLE.extend([huge] * count)
        npools += count
    MAXTUPLE = tuple(sorted(MAXTUPLE))
    cands.sort()
    ncands = len(cands)
    ALLPOOLS = set(range(npools))
    availpools = ALLPOOLS.copy()
    result = [None] * npools

    class Finished(Exception):
        pass

    # Is it possible to match result[:n]? If not, return None.  Else
    # return a matching, a dict whose keys are pool indices and
    # whose values are a permutation of result[:n].
    def match(n):

        def extend(x, seen):
            for y in elt2pools[x]:
                if y not in seen:
                    seen.add(y)
                    if y not in y2x or extend(y2x[y], seen):
                        y2x[y] = x
                        return True
            return False

        y2x = {}
        freexs = []
        # A greedy pass first to grab easy matches.
        for x in islice(result, n):
            for y in elt2pools[x]:
                if y not in y2x:
                    y2x[y] = x
                    break
            else:
                freexs.append(x)

        # Now do real work.
        seen = set()
        for x in freexs:
            seen.clear()
            if not extend(x, seen):
                return None
        return y2x

    def inner(i, j):  # fill result[j:] with cands[i:]
        nonlocal availpools
        if j >= npools:
            r = tuple(result)
            yield r
            if r == MAXTUPLE:
                raise Finished
            return
        restore_availpools = None
        last = None
        jp1 = j + 1
        for i in range(i, ncands):
            elt = cands[i]
            if elt == last:
                continue
            last = result[j] = elt
            pools = elt2pools[elt] & availpools
            if pools:
                pool = pools.pop() # pick one - arbitrary
                availpools.remove(pool)
            else:
                # Find _a_ matching, and if that's possible fiddle
                # availpools to pretend that's the one we used all
                # along.
                m = match(jp1)
                if not m: # the prefix can't be extended with elt
                    continue
                if restore_availpools is None:
                    restore_availpools = availpools.copy()
                availpools = ALLPOOLS - m.keys()
                # Find a pool from which elt was taken.
                for pool, v in m.items():
                    if v == elt:
                        break
                else:
                    assert False
            yield from inner(i+1, jp1)
            availpools.add(pool)

        if restore_availpools is not None:
            availpools = restore_availpools

    try:
        yield from inner(0, 0)
    except Finished:
        pass

这篇关于从重叠的池中挑选无序组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-20 23:10