使用MATLAB进行简单的二进制逻辑回归

我可以遍历 [pihat，lower，upper] = mnrval(b_fit，loopVal(ii)，stats); 以获得各种 pihat 概率值，其中 loopVal = linspace(0,1)或一些适当的输入范围，以及ii = 1:length(loopVal). stats 参数具有很大的相关系数(0.9973)，但是 b_fit 的p值为0.0847和0.0845，我不太确定如何解释.有什么想法吗?另外，在我的示例中，为什么 mrnfit 在 glmfit 上起作用?我应该注意，使用 GeneralizedLinearModel.fit 时，系数的p值均为 p ，并且系数估算值也大不相同.最后，如何解释 mnrfit 函数的 dev 输出?MATLAB文档指出，这是拟合度在解矢量处的偏差.偏差是残差平方和的一般化".这是否可以用作独立值，或者仅与其他模型的 dev 值进行比较?解决方案听起来您的数据可能是线性可分离的.简而言之，这意味着由于您的输入数据是一维的，因此存在 x 的某些值，因此 x 属于一类(例如 y = 0 )，所有 x>xDiv 属于另一类( y = 1 ).如果数据是二维的，则意味着您可以在二维空间 X 上画一条线，以使特定类的所有实例都位于该线的一侧.这对于逻辑回归(LR)来说是个坏消息，因为LR并不是要处理数据可线性分离的问题. Logistic回归正试图拟合以下形式的函数: 当分母中指数内的表达式为负无穷大或无穷大时，这将仅返回 y = 0 或 y = 1 的值.现在，由于您的数据是线性可分离的，并且Matlab的LR函数尝试找到适合数据的最大似然，因此您将获得极高的权重值.这不一定是解决方案，但尝试仅翻转一个数据点上的标签(因此对于某些索引 t ，其中 y(t)== 0 设置 y(t)= 1 ).这将导致您的数据不再是线性可分离的，并且学习到的权重值将被极大地拖到接近零的位置.I'm working on doing a logistic regression using MATLAB for a simple classification problem. My covariate is one continuous variable ranging between 0 and 1, while my categorical response is a binary variable of 0 (incorrect) or 1 (correct).I'm looking to run a logistic regression to establish a predictor that would output the probability of some input observation (e.g. the continuous variable as described above) being correct or incorrect. Although this is a fairly simple scenario, I'm having some trouble running this in MATLAB.My approach is as follows: I have one column vector X that contains the values of the continuous variable, and another equally-sized column vector Y that contains the known classification of each value of X (e.g. 0 or 1). I'm using the following code:[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');However, this gives me nonsensical results with a p = 1.000, coefficients (b) that are extremely high (-650.5, 1320.1), and associated standard error values on the order of 1e6.I then tried using an additional parameter to specify the size of my binomial sample:glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));This gave me results that were more in line with what I expected. I extracted the coefficients, used glmval to create estimates (Y_fit = glmval(b,[0:0.01:1],'logit');), and created an array for the fitting (X_fit = linspace(0,1)). When I overlaid the plots of the original data and the model using figure, plot(X,Y,'o',X_fit,Y_fit'-'), the resulting plot of the model essentially looked like the lower 1/4th of the 'S' shaped plot that is typical with logistic regression plots.My questions are as follows:1) Why did my use of glmfit give strange results?2) How should I go about addressing my initial question: given some input value, what's the probability that its classification is correct?3) How do I get confidence intervals for my model parameters? glmval should be able to input the stats output from glmfit, but my use of glmfit is not giving correct results.Any comments and input would be very useful, thanks!UPDATE (3/18/14)I found that mnrval seems to give reasonable results. I can use [b_fit,dev,stats] = mnrfit(X,Y+1); where Y+1 simply makes my binary classifier into a nominal one.I can loop through [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats); to get various pihat probability values, where loopVal = linspace(0,1) or some appropriate input range and `ii = 1:length(loopVal)'.The stats parameter has a great correlation coefficient (0.9973), but the p values for b_fit are 0.0847 and 0.0845, which I'm not quite sure how to interpret. Any thoughts? Also, why would mrnfit work over glmfit in my example? I should note that the p-values for the coefficients when using GeneralizedLinearModel.fit were both p<<0.001, and the coefficient estimates were quite different as well.Finally, how does one interpret the dev output from the mnrfit function? The MATLAB document states that it is "the deviance of the fit at the solution vector. The deviance is a generalization of the residual sum of squares." Is this useful as a stand-alone value, or is this only compared to dev values from other models? 解决方案 It sounds like your data may be linearly separable. In short, that means since your input data is one dimensional, that there is some value of x such that all values of x < xDiv belong to one class (say y = 0) and all values of x > xDiv belong to the other class (y = 1).If your data were two-dimensional this means you could draw a line through your two-dimensional space X such that all instances of a particular class are on one side of the line.This is bad news for logistic regression (LR) as LR isn't really meant to deal with problems where the data are linearly separable.Logistic regression is trying to fit a function of the following form:This will only return values of y = 0 or y = 1 when the expression within the exponential in the denominator is at negative infinity or infinity.Now, because your data is linearly separable, and Matlab's LR function attempts to find a maximum likelihood fit for the data, you will get extreme weight values.This isn't necessarily a solution, but try flipping the labels on just one of your data points (so for some index t where y(t) == 0 set y(t) = 1). This will cause your data to no longer be linearly separable and the learned weight values will be dragged dramatically closer to zero. 这篇关于使用MATLAB进行简单的二进制逻辑回归的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！