如何在Python中向量化多元标准CDF(累积密度函数)?
当查看this帖子时,我发现有一个“移植”到Python的多变量CDF的Fortran实现。这意味着我可以轻松评估一种特定情况下的CDF。
但是,我很难有效地将此功能应用于多个条目。
具体来说,我需要“向量化”的函数有4个参数:

  • 积分(向量)的下限
  • 积分(向量)的上限
  • 正常随机变量(向量)的均值
  • 正常随机变量(矩阵)的协方差矩阵

  • 但是我正在尝试多次评估超过1000个元素的列表中的此功能。
    这是一些代码来说明我的问题。在下面的示例中,我仅使用随机数据来说明我的观点。
    import time
    import numpy as np
    from scipy.stats.mvn import mvnun # library that calculates MVN CDF
    
    np.random.seed(666)
    
    iters = 1000 # number of times the whole dataset will be evaluated
    obs = 1500 # number of elements in the dataset
    dim = 2 # dimension of multivariate normal distribution
    
    lower = np.random.rand(obs,dim)
    upper = lower + np.random.rand(obs,dim)
    means = np.random.rand(obs,dim)
    
    # Creates a symmetric matrix - used for the random covariance matrices
    def make_sym_matrix(dim,vals):
        m = np.zeros([dim,dim])
        xs,ys = np.triu_indices(dim,k=1)
        m[xs,ys] = vals[:-dim]
        m[ys,xs] = vals[:-dim]
        m[ np.diag_indices(dim) ] = vals[-dim:]
        return m
    
    # Generating the random covariance matrices
    covs = []
    for i in range(obs):
        cov_vals = np.random.rand(int((dim**2 + dim)/2))
        cov_mtx = make_sym_matrix(dim,cov_vals)
        covs.append(cov_mtx)
    covs = np.array(covs)
    
    # Here is where the trouble starts.
    time_start = time.time()
    for i in range(iters):
        results = []
        for j in range(obs):
            this_p, this_i = mvnun(lower[j],upper[j],means[j],covs[j])
            results.append(this_p)
    time_end = time.time()
    
    print(time_end-time_start)
    
    在这里,我有一个包含1500个观察值的数据集,我正在评估1000次。在我的机器上,这需要6.74399995804秒来计算。
    请注意,我并不是想摆脱外部for循环(在i上)。我只是创建它来模仿我的实际问题。我真的试图消除的for循环是内部循环(超过j)。
    如果我找到了一种有效评估整个数据集CDF的方法,则执行时间可能为,大大减少了
    我知道mvnun函数最初是用Fortran编写的(原始代码here),并使用f2pye“移植”到Python,如here所示。
    谁能帮我这个忙吗?我已经开始研究theano了,但是看来我唯一的选择就是使用扫描功能,这可能也没有太大的改进。
    谢谢!!!

    最佳答案

    这只是部分答案,但是如果多元正态分布的维数为小(2或3),并且协方差矩阵保持不变,则有一种提高速度的方法。

    import numpy as np
    import openturns as ot
    
    def computeRectangularDomainProbability(lower, upper, means, cov_matrix):
        """
        Compute the probability of a rectangular solid
        under a multinormal distribution.
    
        """
        # Center the bounds of the rectangular solid on the mean
        lower -= means
        upper -= means
    
        # The same covariance matrix for all rectangular solids.
        cov_matrix = ot.CovarianceMatrix(cov_matrix)
    
        # This way, we only need to define one multivariate normal distribution.
        # That is the trick that allows vectorization.
        dimension = lower.shape[1]
        multinormal = ot.Normal([0.0] * dimension, cov_matrix)
    
        # The probability of the rectangular solid is a weighted sum
        # of the CDF of the vertices (with weights equal to 1 or -1).
        # The following block computes the CDFs and applies the correct weight.
        full_reverse_binary = np.array(list(bin(2**dimension)[:1:-1]), dtype=int)
        prob = 0.0
        for i in range(2**dimension):
            reverse_binary = np.array(list(bin(i)[:1:-1]), dtype=int)
            reverse_binary = np.append(reverse_binary,
                                       np.zeros(len(full_reverse_binary) -
                                                len(reverse_binary) -
                                                1)).astype(int)
            point = np.zeros(lower.shape)
            for num, digit in enumerate(reverse_binary):
                if digit:
                    point[:, num] = upper[:, num]
                else:
                    point[:, num] = lower[:, num]
            cdf = np.array(multinormal.computeCDF(point))
            if (reverse_binary.sum() % 2) == (dimension % 2):
                prob += cdf
            else:
                prob -= cdf
    
        return prob.reshape(-1,)
    
    测试脚本:维度2
    iters = 10 # loop size
    obs = 1500 # number of rectangular solids
    dim = 2 # dimension of multivariate normal distribution
    
    import time
    import numpy as np
    from scipy.stats.mvn import mvnun # library that calculates MVN CDF
    from sklearn.datasets import make_spd_matrix
    import openturns as ot
    
    time_mvnun = 0.0
    time_openturns = 0.0
    discrepancy = 0.0
    np.random.seed(0)
    
    for iteration in range(iters):
    
        lower = np.random.rand(obs,dim)
        upper = lower + np.random.rand(obs,dim)
        means = np.random.rand(obs,dim)
    
        # Generating the random covariance matrices with sklearn
        # to make sure they are positive semi-definite
        cov_mtx = make_spd_matrix(dim)
    
    
        # mvnun is 100 times faster than openturns
    
        time_start = time.time()
        results = []
        for j in range(obs):
            this_p, this_i = mvnun(lower[j],upper[j],means[j],cov_mtx)
            results.append(this_p)
        results = np.array(results)
        time_end = time.time()
        time_mvnun += time_end - time_start
    
    
        time_start = time.time()
        otparallel = computeRectangularDomainProbability(lower, upper, means, cov_mtx)
        time_end = time.time()
        time_openturns += time_end - time_start
    
        mvnorm_vs_otparallel = np.abs(results - otparallel).sum()
        discrepancy += mvnorm_vs_otparallel
    
    print('Dimension {}'.format(dim))
    
    # Print computation time
    print('mvnun     time: {0:e}'.format(time_mvnun))
    print('openturns time: {0:e}'.format(time_openturns))
    print('ratio mvnun/ot: {0:f}'.format(time_mvnun / time_openturns))
    
    # Check that the results are the same for mvnum and openturns
    print('mvnun-openturns result discrepancy: {0:e}'.format(discrepancy))
    
    我的机器上的输出:
    Dimension 2
    mvnun     time: 4.040635e+00
    openturns time: 3.588211e+00
    ratio mvnun/ot: 1.126086
    mvnun-openturns result discrepancy: 8.057912e-11
    
    略有加快:超过10%。
    维度3
    让我们更改控制脚本的全局变量。
    iters = 100 # loop size
    obs = 1500 # number of rectangular solids
    dim = 3 # dimension of multivariate normal distribution
    
    我的机器上的输出:
    Dimension 3
    mvnun     time: 2.378337e+01
    openturns time: 1.596872e+00
    ratio mvnun/ot: 14.893725
    mvnun-openturns result discrepancy: 4.537064e-03
    
    yield 在维度3上要重要得多:拟议的代码快15倍。
    维度4
    不幸的是,openturns在维度4上放慢了很多速度。它包含针对维度1、2和3的CDF的智能实现,但是从大于3的维度上转而采用了更慢,更通用的实现。
    iters = 1 # loop size
    obs = 15 # number of rectangular solids
    dim = 4 # dimension of multivariate normal distribution
    
    Dimension 4
    mvnun     time: 7.289171e-03
    openturns time: 3.689714e+01
    ratio mvnun/ot: 0.000198
    mvnun-openturns result discrepancy: 6.297527e-07
    
    在第4维中,建议的代码要慢4个数量级!这可能是因为在尺寸4中,它需要为每个矩形实体计算16 = 2 ^ 4 CDF,并且这些计算中的每一个都比在较小的尺寸中要慢。

    关于python - 在Python中向量化多元标准CDF(累积密度函数),我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/46396662/

    10-12 18:10