问题描述
我不是统计学家(更像是研究型网络开发人员),但我听说了很多关于 scipy 和 R 这些天.因此,出于好奇,我想问这个问题(尽管这对这里的专家来说可能听起来很愚蠢),因为我不确定该领域的进展情况,并且想知道没有健全的统计背景的人如何解决这些问题.
I am not a statistician (more of a researchy web developer) but I've been hearing a lot about scipy and R these days. So out of curiosity I wanted to ask this question (though it might sound silly to the experts around here) because I am not sure of the advances in this area and want to know how people without a sound statistics background approach these problems.
给定一组从实验中观察到的实数,假设它们属于众多分布中的一个(如 Weibull、Erlang、Cauchy、Exponential 等),是否有任何自动方法可以找到正确的分布以及数据的分布参数?有没有什么好的教程可以引导我完成整个过程?
Given a set of real numbers observed from an experiment, let us say they belong to one of the many distributions out there (like Weibull, Erlang, Cauchy, Exponential etc.), are there any automated ways of finding the right distribution and the distribution parameters for the data? Are there any good tutorials that walk me through the process?
真实场景:例如,假设我发起了一项小型调查并记录了一个人每天与多少人交谈的信息,例如 300 人,我有以下信息:
Real-world Scenario:For instance, let us say I initiated a small survey and recorded information about how many people a person talks to every day for say 300 people and I have the following information:
1 10
2 5
3 20
...
...
XY 告诉我,在调查期间,X 人与 Y 人交谈.现在使用来自 300 人的信息,我想将其拟合到模型中.问题归结为是否有任何自动化方法可以为这些数据找出正确的分布和分布参数,或者如果没有,是否有一个很好的分步程序来实现相同的目标?
where X Y tells me that person X talked to Y people during the period of the survey. Now using the information from the 300 people, I want to fit this into a model. The question boils down to are there any automated ways of finding out the right distribution and distribution parameters for this data or if not, is there a good step-by-step procedure to achieve the same?
推荐答案
这是一个复杂的问题,没有完美的答案.我将尝试为您提供主要概念的概述,并为您指明有关该主题的一些有用阅读材料的方向.
This is a complicated question, and there are no perfect answers. I'll try to give you an overview of the major concepts, and point you in the direction of some useful reading on the topic.
假设您有一组一维数据,并且您有一组有限的概率分布函数,您认为这些数据可能是从中生成的.您可以独立考虑每个分布,并尝试根据您的数据找到合理的参数.给定数据的概率分布函数有两种设置参数的方法:
Assume that you a one dimensional set of data, and you have a finite set of probability distribution functions that you think the data may have been generated from. You can consider each distribution independently, and try to find parameters that are reasonable given your data.There are two methods for setting parameters for a probability distribution function given data:
根据我的经验,近年来,最大似然法一直是首选,尽管并非在每个领域都如此.
In my experience, Maximum Likelihood has been preferred in recent years, although this may not be the case in every field.
以下是如何在 R 中估计参数的具体示例.考虑一组由均值为 0 且标准差为 1 的高斯分布生成的随机点:
Here's a concrete example of how to estimate parameters in R. Consider a set of random points generated from a Gaussian distribution with mean of 0 and standard deviation of 1:
x = rnorm( n = 100, mean = 0, sd = 1 )
假设您知道数据是使用高斯过程生成的,但您忘记了(或从不知道!)高斯的参数.您希望使用这些数据为您提供均值和标准差的合理估计.在 R 中,有一个标准库可以让这一切变得非常简单:
Assume that you know the data were generated using a Gaussian process, but you've forgotten (or never knew!) the parameters for the Gaussian. You'd like to use the data to give you reasonable estimates of the mean and standard deviation. In R, there is a standard library that makes this very straightforward:
library(MASS)
params = fitdistr( x, "normal" )
print( params )
这给了我以下输出:
mean sd
-0.17922360 1.01636446
( 0.10163645) ( 0.07186782)
那些非常接近正确答案,括号中的数字是参数的置信区间.请记住,每次生成一组新的点时,您都会得到一个新的估计答案.
Those are fairly close to the right answer, and the numbers in parentheses are confidence intervals around the parameters. Remember that every time you generate a new set of points, you'll get a new answer for the estimates.
在数学上,这是使用最大似然来估计高斯的均值和标准差.可能性意味着(在这种情况下)给定参数值的数据概率".最大似然是指使生成输入数据的概率最大化的参数值".最大似然估计是寻找使生成输入数据的概率最大化的参数值的算法,对于某些分布,它可能涉及 数值优化 算法.在 R 中,大部分工作由 fitdistr 完成,在某些情况下会调用 optim.
Mathematically, this is using maximum likelihood to estimate both the mean and standard deviation of the Gaussian. Likelihood means (in this case) "probability of data given values of the parameters." Maximum likelihood means "the values of the parameters that maximize the probability of generating my input data." Maximum likelihood estimation is the algorithm for finding the values of the parameters which maximize the probability of generating the input data, and for some distributions it can involve numerical optimization algorithms. In R, most of the work is done by fitdistr, which in certain cases will call optim.
您可以像这样从参数中提取对数似然:
You can extract the log-likelihood from your parameters like this:
print( params$loglik )
[1] -139.5772
使用对数似然而不是避免舍入误差的可能性更常见.估计数据的联合概率涉及乘以概率,这些概率都小于 1.即使对于一小部分数据,联合概率也会很快接近 0,并且将数据的对数概率相加等于将概率相乘.当对数似然接近 0 时,似然最大化,因此负数越多,数据拟合越差.
It's more common to work with the log-likelihood rather than likelihood to avoid rounding errors. Estimating the joint probability of your data involves multiplying probabilities, which are all less than 1. Even for a small set of data, the joint probability approaches 0 very quickly, and adding the log-probabilities of your data is equivalent to multiplying the probabilities. The likelihood is maximized as the log-likelihood approaches 0, and thus more negative numbers are worse fits to your data.
使用这样的计算工具,可以轻松估计任何分布的参数.考虑这个例子:
With computational tools like this, it's easy to estimate parameters for any distribution. Consider this example:
x = x[ x >= 0 ]
distributions = c("normal","exponential")
for ( dist in distributions ) {
print( paste( "fitting parameters for ", dist ) )
params = fitdistr( x, dist )
print( params )
print( summary( params ) )
print( params$loglik )
}
指数分布不会产生负数,所以我在第一行删除了它们.输出(随机)如下所示:
The exponential distribution doesn't generate negative numbers, so I removed them in the first line. The output (which is stochastic) looked like this:
[1] "fitting parameters for normal"
mean sd
0.72021836 0.54079027
(0.07647929) (0.05407903)
Length Class Mode
estimate 2 -none- numeric
sd 2 -none- numeric
n 1 -none- numeric
loglik 1 -none- numeric
[1] -40.21074
[1] "fitting parameters for exponential"
rate
1.388468
(0.196359)
Length Class Mode
estimate 1 -none- numeric
sd 1 -none- numeric
n 1 -none- numeric
loglik 1 -none- numeric
[1] -33.58996
实际上,指数分布比正态分布更可能生成这些数据,这可能是因为指数分布不必为负数分配任何概率密度.
The exponential distribution is actually slightly more likely to have generated this data than the normal distribution, likely because the exponential distribution doesn't have to assign any probability density to negative numbers.
当您尝试将数据拟合到更多分布时,所有这些估计问题都会变得更糟.具有更多参数的分布更灵活,因此它们比具有更少参数的分布更适合您的数据.此外,某些发行版是其他发行版的特例(例如,Exponential 是Gamma).因此,使用先验知识将您的选择模型限制为所有可能模型的子集是很常见的.
All of these estimation problems get worse when you try to fit your data to more distributions. Distributions with more parameters are more flexible, so they'll fit your data better than distributions with less parameters. Also, some distributions are special cases of other distributions (for example, the Exponential is a special case of the Gamma). Because of this, it's very common to use prior knowledge to constrain your choice models to a subset of all possible models.
解决参数估计中的一些问题的一个技巧是生成大量数据,并将一些数据留给交叉验证.要交叉验证参数与数据的拟合情况,请在估计过程中保留一些数据,然后测量每个模型在保留数据上的可能性.
One trick to get around some problems in parameter estimation is to generate a lot of data, and leave some of the data out for cross-validation. To cross-validate your fit of parameters to data, leave some of the data out of your estimation procedure, and then measure each model's likelihood on the left-out data.
这篇关于将数据拟合到分布?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!