python - 一次完成多次curve_fit的迭代以实现分段功能

我正在尝试一次执行Scipy的curve_fit的许多迭代，以便避免循环并因此提高速度。

这与this problem非常相似，已解决。但是，由于函数是分段的(不连续的)，因此该解决方案不适用于此处。

考虑以下示例:

import numpy as np
from numpy import random as rng
from scipy.optimize import curve_fit
rng.seed(0)
N=20
X=np.logspace(-1,1,N)
Y = np.zeros((4, N))
for i in range(0,4):
    b = i+1
    a = b
    print(a,b)
    Y[i] = (X/b)**(-a) #+ 0.01 * rng.randn(6)
    Y[i, X>b] = 1

这将产生以下数组:

python - 一次完成多次curve_fit的迭代以实现分段功能-LMLPHP

如您所见，这在X==b处是不连续的。我可以迭代地使用a来检索b和curve_fit的原始值:

def plaw(r, a, b):
    """ Theoretical power law for the shape of the normalized conditional density """
    import numpy as np
    return np.piecewise(r, [r < b, r >= b], [lambda x: (x/b)**-a, lambda x: 1])


coeffs=[]
for ix in range(Y.shape[0]):
    print(ix)
    c0, pcov = curve_fit(plaw, X, Y[ix])
    coeffs.append(c0)

但是，根据X，Y和循环的大小，此过程可能会非常慢，因此我试图通过尝试获取coeffs而不需要循环来加快处理速度。到目前为止，我还没有运气。

可能重要的事情:

X和Y仅包含正值

a和b始终为正

尽管此示例中要拟合的数据是平滑的(为简单起见)，但实际数据具有噪声

编辑

据我所知:

y=np.ma.masked_where(Y<1.01, Y)

lX = np.log(X)
lY = np.log(y)
A = np.vstack([lX, np.ones(len(lX))]).T
m,c=np.linalg.lstsq(A, lY.T)[0]

print('a=',-m)
print('b=',np.exp(-c/m))

但是，即使没有任何噪音，输出也仍然是:

a= [0.18978965578339158 1.1353633705997466 2.220234483915197 3.3324502660995714]
b= [339.4090881838179 7.95073481873057 6.296592007396107 6.402567167503574]

这比我希望得到的要糟糕得多。

最佳答案

这里有三种加快速度的方法。您没有提供所需的速度或精度，甚至没有向量大小，因此买家要当心。

TL; DR

时间:

len       1      2      3      4
1000    0.045  0.033  0.025  0.022
10000   0.290  0.097  0.029  0.023
100000  3.429  0.767  0.083  0.030
1000000               0.546  0.046

1) Original Method
2) Pre-estimate with Subset
3) M Newville [linear log-log estimate](https://stackoverflow.com/a/44975066/7311767)
4) Subset Estimate (Use Less Data)

用子集进行预先估计(方法2):

只需运行两次curve_fit即可获得不错的加速效果，其中第一次使用curve_fit的一小部分数据来快速估算。然后，该估计值将用于整个数据集的curve_fit播种。

x, y = current_data
stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]
return curve_fit(power_law, x, y, p0=c0)[0]

M Newville linear log-log estimate(方法3):

使用M Newville提出的对数估计，速度也相当快。由于OP担心Newville提出的初始估计方法，因此该方法使用ojit_code及其子集来提供曲线断点的估计。

x, y = current_data
stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]

index_max = np.where(x > c0[1])[0][0]
log_x = np.log(x[:index_max])
log_y = np.log(y[:index_max])
result = linregress(log_x, log_y)
return -result[0], np.exp(-result[1] / result[0])
return (m, c), result

使用较少的数据(方法4):

最后，前两种方法所使用的种子机制对样本数据提供了很好的估计。当然，这是示例数据，因此您的里程可能会有所不同。

stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]

测试代码:

import numpy as np
from numpy import random as rng
from scipy.optimize import curve_fit
from scipy.stats import linregress

fit_data = {}
current_data = None

def data_for_fit(a, b, n):
    key = a, b, n
    if key not in fit_data:
        rng.seed(0)
        x = np.logspace(-1, 1, n)
        y = np.clip((x / b) ** (-a) + 0.01 * rng.randn(n), 0.001, None)
        y[x > b] = 1
        fit_data[key] = x, y
    return fit_data[key]


def power_law(r, a, b):
    """ Power law for the shape of the normalized conditional density """
    import numpy as np
    return np.piecewise(
        r, [r < b, r >= b], [lambda x: (x/b)**-a, lambda x: 1])

def method1():
    x, y = current_data
    return curve_fit(power_law, x, y)[0]

def method2():
    x, y = current_data
    return curve_fit(power_law, x, y, p0=method4()[0])

def method3():
    x, y = current_data
    c0, pcov = method4()

    index_max = np.where(x > c0[1])[0][0]
    log_x = np.log(x[:index_max])
    log_y = np.log(y[:index_max])
    result = linregress(log_x, log_y)
    m, c = -result[0], np.exp(-result[1] / result[0])
    return (m, c), result

def method4():
    x, y = current_data
    stride = int(max(1, len(x) / 200))
    return curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])

from timeit import timeit

def runit(stmt):
    print("%s: %.3f  %s" % (
        stmt, timeit(stmt + '()', number=10,
                     setup='from __main__ import ' + stmt),
        eval(stmt + '()')[0]
    ))

def runit_size(size):

    print('Length: %d' % size)
    if size <= 100000:
        runit('method1')
        runit('method2')
    runit('method3')
    runit('method4')


for i in (1000, 10000, 100000, 1000000):
    current_data = data_for_fit(3, 3, i)
    runit_size(i)

关于python - 一次完成多次curve_fit的迭代以实现分段功能，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/44957705/