本文介绍了使(三重)高斯适合数据python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题的简短版本如下:我有一些数据的直方图(行星的密度),似乎有3个偷看。现在,我想让3个高斯人适合此直方图。

The short version of my problem is the following: I have a histogram of some data (density of planets) which seems to have 3 peeks. Now I want to fit 3 gaussians to this histogram.

我期望这个结果。

我使用了不同的方法来拟合我的高斯:curve_fit,最小平方和sklearn.mixture中的GaussianMixture。通过Curve_fit,我可以很好地适应

I used different methods to fit my gauss: curve_fit, least square and GaussianMixture from sklearn.mixture. With Curve_fit I get a pretty good fit

,但如果将其与我的预期结果进行比较,那还不够好。用最小的平方,我得到一个合适的

but it isn't good enough if you compare it to my expected outcome. With least square I get a "good fit"

但是我的高斯人是胡说八道,而使用GaussianMixture却一事无成,因为我无法真正熟练地使用在示例中看到的代码来解决问题。

but my gaussians are nonsense, and with GaussianMixture I don't get anywhere, because I can't really adept the codes I've seen in Examples to my problem.

在这一点上,我有三个问题:


  1. 最重要的是:我怎样才能变得更好适合我的第三高斯?我已经尝试过调整p0的初始值,但是高斯甚至变得更糟,或者根本找不到参数。

  1. Most important: How can I get a better fit for my third gaussian? I already tried to adjust the initial values for p0, but then the gaussian becomes even worst or it doesn't find the parameters at all.

怎么了?用我的最小二乘代码?为什么给我这么奇怪的高斯?有没有办法解决这个问题? 我的猜测:是因为最小二乘会做所有事情来最小化拟合数据和实际数据之间的误差吗?

What is wrong with my least-square Code? Why does it give me such strange gaussians? And is there a way to fix this? My Guess: Is it because least-square does everything to minimize the error between fit and actual data?

我该怎么办?用GaussianMixture进行整件事?我发现了这个帖子

How would I do the whole thing with GaussianMixture? I found this post

但无法适应我的问题。

我真的很想了解如何正确拟合,因为将来我将不得不做很多工作。问题是我的统计数据不是很好,只是开始用python编程。

I really want to understand how to make fits properly, since I will have to do them a lot in the future. The problem is that I am not very good in statistics and just started programming in python.

这是我的三个不同代码:

Here are my three different codes:

曲线

import numpy as np
import math
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

hist, bin_edges = np.histogram(Density, bins=np.logspace(np.log10(MIN), np.log10(MAX), 32))

bin_centres = (bin_edges[:-1] + bin_edges[1:])/2

# Define model function to be used to fit to the data above:
def triple_gaussian(  x,*p ):
    (c1, mu1, sigma1, c2, mu2, sigma2, c3, mu3, sigma3) = p
    res =    np.divide(1,x)*c1 * np.exp( - (np.log(x) - mu1)**2.0 / (2.0 * sigma1**2.0) ) \
          +  np.divide(1,x)*c2 * np.exp( - (np.log(x) - mu2)**2.0 / (2.0 * sigma2**2.0) ) \
          +  np.divide(1,x)*c3 * np.exp( - (np.log(x) - mu3)**2.0 / (2.0 * sigma3**2.0) )
    return res

# p0 is the initial guess for the fitting coefficients (A, mu and sigma above)
p0 = [60., 1, 1., 30., 1., 1.,10., 1., 1]

coeff, var_matrix = curve_fit(triple_gaussian, bin_centres, hist, p0=p0)

# Get the fitted curve
hist_fit = triple_gaussian(bin_centres, *coeff)

c1 =coeff[0]
mu1 =coeff[1]
sigma1 =coeff[2]
c2 =coeff[3]
mu2 =coeff[4]
sigma2 =coeff[5]
c3 =coeff[6]
mu3 =coeff[7]
sigma3 =coeff[8]
x= bin_centres

gauss1= np.divide(1,x)*c1 * np.exp( - (np.log(x) - mu1)**2.0 / (2.0 * sigma1**2.0) )
gauss2= np.divide(1,x)*c2 * np.exp( - (np.log(x) - mu2)**2.0 / (2.0 * sigma2**2.0) )
gauss3= np.divide(1,x)*c3 * np.exp( - (np.log(x) - mu3)**2.0 / (2.0 * sigma3**2.0) )

plt.plot(x,gauss1, 'g',label='gauss1')
plt.plot(x,gauss2, 'b', label='gauss2')
plt.plot(x,gauss3, 'y', label='gauss3')
plt.gca().set_xscale("log")
plt.legend(loc='upper right')
plt.ylim([0,70])
plt.suptitle('Triple log Gaussian fit over all Data', fontsize=20)
plt.xlabel('log(Density)')
plt.ylabel('Number')

plt.hist(Density, bins=np.logspace(np.log10(MIN), np.log10(MAX), 32), label='all Densities')
plt.plot(bin_centres, hist, label='Test data')
plt.plot(bin_centres, hist_fit, label='Fitted data')
plt.gca().set_xscale("log")
plt.ylim([0,70])
plt.suptitle('triple log Gaussian fit using curve_fit', fontsize=20)
plt.xlabel('log(Density)')
plt.ylabel('Number')
plt.legend(loc='upper right')
plt.annotate(Text1, xy=(0.01, 0.95), xycoords='axes fraction')
plt.annotate(Text2, xy=(0.01, 0.90), xycoords='axes fraction')
plt.savefig('all Densities_gauss')
plt.show()

最小二乘

适合自身看起来很糟糕,但是这三个高斯人都太恐怖了。在这里看到

The fit itselfe doesn't look to bad, but the 3 gaussians are horrible. See here

    # I only have x-data, so to get according y-data I make my histogram and
 #use the bins as x-data and the numbers (hist) as y-data.
#Density is a Dataset of 581 Values between 0 and 340.

hist, bin_edges = np.histogram(Density, bins=np.logspace(np.log10(MIN), np.log10(MAX), 32))
x = (bin_edges[:-1] + bin_edges[1:])/2
y = hist

#define tripple gaussian

def triple_gaussian(  p,x ):
    (c1, mu1, sigma1, c2, mu2, sigma2, c3, mu3, sigma3) = p
    res =    np.divide(1,x)*c1 * np.exp( - (np.log(x) - mu1)**2.0 / (2.0 * sigma1**2.0) ) \
          +  np.divide(1,x)*c2 * np.exp( - (np.log(x) - mu2)**2.0 / (2.0 * sigma2**2.0) ) \
          +  np.divide(1,x)*c3 * np.exp( - (np.log(x) - mu3)**2.0 / (2.0 * sigma3**2.0) )
    return res

def errfunc(p,x,y):
   return y-triple_gaussian(p,x)

p0=[]
p0 = [60., 0.1, 1., 30., 1., 1.,10., 10., 1.]
fit = optimize.leastsq(errfunc,p0,args=(x,y))

print('fit', fit)



plt.plot(x,y)
plt.plot(x,triple_gaussian(fit[0],x), 'r')
plt.gca().set_xscale("log")
plt.ylim([0,70])
plt.suptitle('Double log Gaussian fit over all Data', fontsize=20)
plt.xlabel('log(Density)')
plt.ylabel('Number')

c1, mu1, sigma1, c2, mu2, sigma2, c3, mu3, sigma3=fit[0]

print('c1', c1)

gauss1= np.divide(1,x)*c1 * np.exp( - (np.log(x) - mu1)**2.0 / (2.0 * sigma1**2.0) )
gauss2= np.divide(1,x)*c2 * np.exp( - (np.log(x) - mu2)**2.0 / (2.0 * sigma2**2.0) )
gauss3= np.divide(1,x)*c3 * np.exp( - (np.log(x) - mu3)**2.0 / (2.0 * sigma3**2.0) )

plt.plot(x,gauss1, 'g')
plt.plot(x,gauss2, 'b')
plt.plot(x,gauss3, 'y')
plt.gca().set_xscale("log")
plt.ylim([0,70])
plt.suptitle('Double log Gaussian fit over all Data', fontsize=20)
plt.xlabel('log(Density)')
plt.ylabel('Number')

GaussianMixture

正如我之前所说,我不太了解GaussianMixture。我不知道是否必须像以前一样定义三次高斯,或者是否足以定义高斯,而GaussianMixture会发现自己存在三次高斯。
我也不知道必须在哪里使用哪些数据,因为当我使用bin和hist值时,拟合曲线就是彼此相连的数据点。所以我发现我使用了错误的数据。

As I said befor, I don't really understand GaussianMixture. I don't know if I have to define a triplegauss like before, or if it's enough to define the gauss, and GaussianMixture will find out that there is a triple gauss on its own.I also don't understand where I have to use which data, because when I use the bins and hist values, then the "fitted curve" are just the datapoints connected with each other. So I figured that I used the wrong data.

我不理解的部分是 #Fit GMM #Construct函数

hist, bin_edges = np.histogram(Density, bins=np.logspace(np.log10(MIN), np.log10(MAX), 32))

bin_centres = (bin_edges[:-1] + bin_edges[1:])/2

plt.hist(Density, bins=np.logspace(np.log10(MIN), np.log10(MAX), 32), label='all Densities')
plt.gca().set_xscale("log")
plt.ylim([0,70])

# Define simple gaussian
def gauss_function(x, amp, x0, sigma):
    return np.divide(1,x)*amp * np.exp(-(np.log(x) - x0) ** 2. / (2. * sigma ** 2.))

# My Data
samples = Density

# Fit GMM
gmm = GaussianMixture(n_components=3, covariance_type="full", tol=0.00001)
gmm = gmm.fit(X=np.expand_dims(samples, 1))

gmm_x= bin_centres
gmm_y= hist
# Construct function manually as sum of gaussians
gmm_y_sum = np.full_like(gmm_x, fill_value=0, dtype=np.float32)
for m, c, w in zip(gmm.means_.ravel(), gmm.covariances_.ravel(), gmm.weights_.ravel()):
    gauss = gauss_function(x=gmm_x, amp=1, x0=m, sigma=np.sqrt(c))
    gmm_y_sum += gauss / np.trapz(gauss, gmm_x) *w

# Make regular histogram
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=[8, 5])
ax.hist(samples, bins=np.logspace(np.log10(MIN), np.log10(MAX), 32), label='all Densities')
ax.plot(gmm_x, gmm_y, color="crimson", lw=4, label="GMM")
ax.plot(gmm_x, gmm_y_sum, color="black", lw=4, label="Gauss_sum", linestyle="dashed")
plt.gca().set_xscale("log")
plt.ylim([0,70])

# Annotate diagram
ax.set_ylabel("Probability density")
ax.set_xlabel("Arbitrary units")

# Make legend
plt.legend()

plt.show()

我希望有人至少可以帮助我解决我的一个问题。正如我之前所说,如果有任何遗漏或需要更多信息,请告诉我。

I hope that anyone can help me at least with one of my questions. As I said befor, if anything is missing or if you need more information please tell me.

谢谢!

-编辑-
是我的数据。

推荐答案

链接到实际数据会有所帮助,但是我可以在没有数据的情况下提出一些建议。

Having a link to actual data would be helpful, but I can make a few recommendations without the data.

首先,进行转换 x np.log(x)很容易,因此值得付出努力。

First, converting x to np.log(x) is so easy that it is probably worth the effort.

第二,高斯的定义通常不包含 1./x -可能效果不大,但您的价值观 x 的变化幅度很大,所以也许没有变化。

Second, the definition for Gaussian doesn't normally include 1./x -- it might be a small effect, but your values of x are changing by an order of magnitude, so maybe not.

第三,您提供的是相同的所有三个高斯人的 mu 的起始值。这使拟合更加困难。尝试给出更接近实际期望值的起点,并在可能的范围内给出起点。

Third, you're providing the same starting value for mu for all three Gaussians. This makes the fit much harder. Try to give starting points that are closer to the actual expected values, and if possible bounds on those values.

为解决这些问题,您可能会找到lmfit()很有帮助。它肯定会使您的脚本更短,例如

To help address these points, you might find lmfit (https://lmfit.github.io/lmfit-py/) helpful. It will definitely make your script shorter, perhaps something like

import numpy as np
import matplotlib.pyplot as plt
from lmfit.models import GaussianModel

y, bin_edges = np.histogram(Density, bins=np.logspace(np.log10(MIN), np.log10(MAX), 32))

x = np.log((bin_edges[:-1] + bin_edges[1:])/2.0) #take log here

# build a model as a sum of 3 Gaussians
model = (GaussianModel(prefix='g1_') + GaussianModel(prefix='g2_') +
         GaussianModel(prefix='g3_'))

# build Parameters with initial values
params = model.make_params(g1_amplitude=60, g1_center=-1.0, g1_sigma=1,
                           g2_amplitude=30, g2_center= 0.0, g1_sigma=1,
                           g2_amplitude=10, g2_center= 1.0, g1_sigma=1)

# optionally, set bound / constraints on Parameters:
params['g1_center'].max = 0

params['g2_center'].min = -1.0
params['g2_center'].max = 1.0

params['g3_center'].min = 0

# perform the actual fit
result = model.fit(y, params, x=x)

# print fit statistics and values and uncertainties for variables
print(result.fit_report())

# evaluate the model components ('g1_', 'g2_', and 'g3_')
comps = result.eval_components(result.params, x=x)

# plot the results
plt.plot(x, y, label='data')
plt.plot(x, result.best_fit, label='best fit')

plt.plot(x, comps['g1_'], label='gaussian1')
plt.plot(x, comps['g2_'], label='gaussian2')
plt.plot(x, comps['g3_'], label='gaussian3')
# other plt methods for axes and labels
plt.show()

如果模型确实需要(1 / x)乘以高斯,或者您需要其他函数形式。您可以使用内置的LognormalModel或其他内置模型之一,也可以轻松编写自己的模型函数并将其包装。

If your model really needs (1/x) times a Gaussian, or you need a different functional form. You could use the built-in LognormalModel, one of the other built-in Models, or easily write your own model function and wrap that.

希望有帮助。

这篇关于使(三重)高斯适合数据python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-14 10:55