


I am working on two large data sets, and my question is as follows.


list1 = [A,B,C,D]

list2 = [B,D,A,G]

除O(n )搜索外,如何使用Python如何有效地找到匹配索引?结果应如下所示:

How can I efficiently find the matching index, using Python, other than O(n) searching? The result should look like:

matching_index(list1,list2) -> [(0,2),(1,0),(3,1)]




def find_matching_index(list1, list2):

    inverse_index = { element: index for index, element in enumerate(list1) }

    return [(index, inverse_index[element])
        for index, element in enumerate(list2) if element in inverse_index]

find_matching_index([1,2,3], [3,2,1]) # [(0, 2), (1, 1), (2, 0)]



With duplicates

You can extend the previous solution to account for duplicates. You can keep track of multiple indices with a set.

def find_matching_index(list1, list2):

    # Create an inverse index which keys are now sets
    inverse_index = {}

    for index, element in enumerate(list1):

        if element not in inverse_index:
            inverse_index[element] = {index}


    # Traverse the second list
    matching_index = []

    for index, element in enumerate(list2):

        # We have to create one pair by element in the set of the inverse index
        if element in inverse_index:
            matching_index.extend([(x, index) for x in inverse_index[element]])

    return matching_index

find_matching_index([1, 1, 2], [2, 2, 1]) # [(2, 0), (2, 1), (0, 2), (1, 2)]

不幸的是,这不再是 O(n).考虑输入[1, 1][1, 1]的情况,输出为[(0, 0), (0, 1), (1, 0), (1, 1)].因此,根据输出的大小,最坏的情况不能比O(n^2)更好.

Unfortunately, this is no longer O(n). Consider the case where you input [1, 1] and [1, 1], the output is [(0, 0), (0, 1), (1, 0), (1, 1)]. Thus by the size of the output, the worst case cannot be better than O(n^2).


Although, this solution is still O(n) if there are no duplicates.


Now comes the case where your objects are not hashable, but comparable. The idea here will be to sort your lists in a way that preserves the origin index of each element. Then we can group sequences of elements that are equal to get matching indices.


Since we make heavy use of groupby and product in the following code, I made find_matching_index return a generator for memory efficiency on long lists.

from itertools import groupby, product

def find_matching_index(list1, list2):
    sorted_list1 = sorted((element, index) for index, element in enumerate(list1))
    sorted_list2 = sorted((element, index) for index, element in enumerate(list2))

    list1_groups = groupby(sorted_list1, key=lambda pair: pair[0])
    list2_groups = groupby(sorted_list2, key=lambda pair: pair[0])

    for element1, group1 in list1_groups:
            element2, group2 = next(list2_groups)
            while element1 > element2:
                (element2, _), group2 = next(list2_groups)

        except StopIteration:

        if element2 > element1:

        indices_product = product((i for _, i in group1), (i for _, i in group2), repeat=1)

        yield from indices_product

        # In version prior to 3.3, the above line must be
        # for x in indices_product:
        #     yield x

list1 = [[], [1, 2], []]
list2 = [[1, 2], []]

list(find_matching_index(list1, list2)) # [(0, 1), (2, 1), (1, 0)]

事实证明,时间复杂度不会受到太大影响.排序当然需要O(n log(n)),但是groupby提供的生成器可以通过仅遍历我们的列表两次来恢复所有元素.结论是,我们的复杂性主要受product输出的大小限制.因此,给出算法为O(n log(n))的最佳情况,而算法再次为O(n^2)的最坏情况.

It turns out that time complexity does not suffer that much. Sorting of course takes O(n log(n)), but then groupby provides generators that can recover all elements by traversing our lists only twice. The conclusion is that our complexity is primarly bound by the size of the output of product. Thus giving a best case where the algorithm is O(n log(n)) and a worst case that is once again O(n^2).


09-05 09:50