随机矩阵所有行的快速随机加权选择

本文介绍了随机矩阵所有行的快速随机加权选择的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

numpy.random.choice允许从向量(即

arr = numpy.array([1, 2, 3])
weights = numpy.array([0.2, 0.5, 0.3])
choice = numpy.random.choice(arr, p=weights)

选择概率为0.2的1，概率为0.5的2，概率为0.3的3.

selects 1 with probability 0.2, 2 with probability 0.5, and 3 with probability 0.3.

如果我们想对每个行都是概率向量的2D数组(矩阵)以向量化的方式快速进行操作，该怎么办?也就是说，我们想要一个来自随机矩阵的选择向量吗?这是超级慢的方法:

What if we wanted to do this quickly in a vectorized fashion for a 2D array (matrix) for which each of the rows are a vector of probabilities? That is, we want a vector of choices from a stochastic matrix? This is the super slow way:

import numpy as np

m = 10
n = 100 # Or some very large number

items = np.arange(m)
prob_weights = np.random.rand(m, n)
prob_matrix = prob_weights / prob_weights.sum(axis=0, keepdims=True)

choices = np.zeros((n,))
# This is slow, because of the loop in Python
for i in range(n):
    choices[i] = np.random.choice(items, p=prob_matrix[:,i])

print(choices):

array([ 4.,  7.,  8.,  1.,  0.,  4.,  3.,  7.,  1.,  5.,  7.,  5.,  3.,
        1.,  9.,  1.,  1.,  5.,  9.,  8.,  2.,  3.,  2.,  6.,  4.,  3.,
        8.,  4.,  1.,  1.,  4.,  0.,  1.,  8.,  5.,  3.,  9.,  9.,  6.,
        5.,  4.,  8.,  4.,  2.,  4.,  0.,  3.,  1.,  2.,  5.,  9.,  3.,
        9.,  9.,  7.,  9.,  3.,  9.,  4.,  8.,  8.,  7.,  6.,  4.,  6.,
        7.,  9.,  5.,  0.,  6.,  1.,  3.,  3.,  2.,  4.,  7.,  0.,  6.,
        3.,  5.,  8.,  0.,  8.,  3.,  4.,  5.,  2.,  2.,  1.,  1.,  9.,
        9.,  4.,  3.,  3.,  2.,  8.,  0.,  6.,  1.])

这篇文章表明cumsum和bisect可能是一种潜在的方法，而且速度很快.但是，尽管numpy.cumsum(arr, axis=1)可以沿numpy数组的一个轴执行此操作，但 bisect.bisect 函数一次只能在单个数组上运行.同样， numpy.searchsorted 仅适用于一维数组，因为好吧.

This post suggests that cumsum and bisect could be a potential approach, and is fast. But while numpy.cumsum(arr, axis=1) can do this along one axis of a numpy array, the bisect.bisect function only works on a single array at a time. Similarly, numpy.searchsorted only works on 1D arrays as well.

是否有一种仅使用矢量化操作的快速方法?

Is there a quick way to do this using only vectorized operations?

推荐答案

这是一个非常快速的完全矢量化版本:

Here's a fully vectorized version that's pretty fast:

def vectorized(prob_matrix, items):
    s = prob_matrix.cumsum(axis=0)
    r = np.random.rand(prob_matrix.shape[1])
    k = (s < r).sum(axis=0)
    return items[k]

理论上，searchsorted是用于在累积求和的概率中查找随机值的正确函数，但是m相对较小，k = (s < r).sum(axis=0)最终却很多快点.它的时间复杂度为O(m)，而searchsorted方法为O(log(m))，但这仅对更大的m重要. 也，cumsum是O(m)，所以vectorized和@perimosocordiae的improved都是O(m). (实际上，如果您的m更大，则必须运行一些测试以查看m的大小，然后此方法才能变慢.)

In theory, searchsorted is the right function to use for looking up the random value in the cumulatively summed probabilities, but with m being relatively small, k = (s < r).sum(axis=0) ends up being much faster. Its time complexity is O(m), while the searchsorted method is O(log(m)), but that will only matter for much larger m. Also, cumsum is O(m), so both vectorized and @perimosocordiae's improved are O(m). (If your m is, in fact, much larger, you'll have to run some tests to see how large m can be before this method is slower.)

这是我使用m = 10和n = 10000的时间(使用@perimosocordiae的答案中的功能original和improved):

Here's the timing I get with m = 10 and n = 10000 (using the functions original and improved from @perimosocordiae's answer):

In [115]: %timeit original(prob_matrix, items)
1 loops, best of 3: 270 ms per loop

In [116]: %timeit improved(prob_matrix, items)
10 loops, best of 3: 24.9 ms per loop

In [117]: %timeit vectorized(prob_matrix, items)
1000 loops, best of 3: 1 ms per loop

定义功能的完整脚本为:

The full script where the functions are defined is:

import numpy as np


def improved(prob_matrix, items):
    # transpose here for better data locality later
    cdf = np.cumsum(prob_matrix.T, axis=1)
    # random numbers are expensive, so we'll get all of them at once
    ridx = np.random.random(size=n)
    # the one loop we can't avoid, made as simple as possible
    idx = np.zeros(n, dtype=int)
    for i, r in enumerate(ridx):
        idx[i] = np.searchsorted(cdf[i], r)
    # fancy indexing all at once is faster than indexing in a loop
    return items[idx]


def original(prob_matrix, items):
    choices = np.zeros((n,))
    # This is slow, because of the loop in Python
    for i in range(n):
        choices[i] = np.random.choice(items, p=prob_matrix[:,i])
    return choices


def vectorized(prob_matrix, items):
    s = prob_matrix.cumsum(axis=0)
    r = np.random.rand(prob_matrix.shape[1])
    k = (s < r).sum(axis=0)
    return items[k]


m = 10
n = 10000 # Or some very large number

items = np.arange(m)
prob_weights = np.random.rand(m, n)
prob_matrix = prob_weights / prob_weights.sum(axis=0, keepdims=True)

这篇关于随机矩阵所有行的快速随机加权选择的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！