python - 与两个numpy数组相交并排序的索引

我有两个整数的numpy数组，长度均为几亿。在每个数组中，值都是唯一的，并且每个值最初都是未排序的。

我想要每个产生其排序交集的索引。例如:

x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])

然后它们的排序交集为[4, 5, 11]，因此我们希望将x和y中的每个都变成该数组的索引，因此我们希望它返回:

mx = np.array([0, 3, 6])
my = np.array([2, 1, 4])

从那以后x[mx] == y[my] == np.intersect1d(x, y)
到目前为止，我们唯一的解决方案涉及三个不同的argsorts，因此似乎不太可能是最优的。

每个值代表一个星系，以防使问题变得更加有趣。

最佳答案

这是一个基于intersect1d的实现的选项，非常简单。它需要一个对argsort的调用。

公认的简单测试通过。

import numpy as np


def my_intersect(x, y):
    """my_intersect(x, y) -> xm, ym
    x, y: 1-d arrays of unique values
    xm, ym: indices into x and y giving sorted intersection
    """
    # basic idea taken from numpy.lib.arraysetops.intersect1d
    aux = np.concatenate((x, y))
    sidx = aux.argsort()
    # Note: intersect1d uses aux[:-1][aux[1:]==aux[:-1]] here - I don't know why the first [:-1] is necessary
    inidx = aux[sidx[1:]] == aux[sidx[:-1]]

    # quicksort is not stable, so must do some work to extract indices
    # (if stable, sidx[inidx.nonzero()]  would be for x)
    # interlace the two sets of indices, and check against lengths
    xym = np.vstack((sidx[inidx.nonzero()],
                     sidx[1:][inidx.nonzero()])).T.flatten()

    xm = xym[xym < len(x)]
    ym = xym[xym >= len(x)] - len(x)

    return xm, ym


def check_my_intersect(x, y):
    mx, my = my_intersect(x, y)
    assert (x[mx] == np.intersect1d(x, y)).all()

    # not really necessary: np.intersect1d returns a sorted list
    assert (x[mx] == sorted(x[mx])).all()
    assert (x[mx] == y[my]).all()


def random_unique_unsorted(n):
    while True:
        x = np.unique(np.random.randint(2*n, size=n))
        if len(x):
            break
    np.random.shuffle(x)
    return x


x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])

check_my_intersect(x, y)


for i in range(20):
    x = random_unique_unsorted(100+i)
    y = random_unique_unsorted(200+i)
    check_my_intersect(x, y)

编辑:“注释”注释令人困惑(将...用作语音省略号，也忘记了它也是Python运算符)。