问题描述
在具有完美多重共线性的情况下在Julia中运行简单的回归模型会产生错误.在R中,我们可以在R解释的相应协变量的估计中运行产生NA的相同模型:由于奇异而未定义".我们可以使用R中的alias()
函数来识别这些变量.
Running a simple regression model in Julia with the presence of perfect multicollinearity produces an error. In R, we can run the same model producing NAs in the estimations of the corresponding covariates which R interprets: "not defined because of singularities". We can identify those variables using the alias()
function in R.
为了删除共线性变量,有什么方法可以在建模之前在Julia中检查完美的多重共线性吗?
Is there any way I can check for perfect multicollinearity in Julia prior to modeling in order to drop the collinear variables?
推荐答案
检测完美共线性
假设X
是您的设计矩阵.您可以通过运行以下命令检查完美的多重共线性:
Suppose that X
is your design matrix. You can check for perfect multicollinearity by running:
rank(X) == size(X,2)
如果您具有完美的多重共线性,则将产生false
.
This will yield false
if you have perfect multicollinearity.
识别近共线性+查找哪些列是共线的或近共线的
我不知道有任何特定的内置函数.但是,线性代数的一些基本原理的应用可以很容易地确定这一点.下面是我编写的用于执行此操作的函数,然后为感兴趣的人提供了更详细的说明.其要点是,我们要查找X*X'
的特征值,该特征值为零(对于理想共线性)或接近零(对于近共线性).然后,我们找到与那些特征值相关的特征向量.非零(对于理想共线性)或适度偏大(因近共线性"性质不明确而模糊的术语)特征向量的组成部分是具有共线性问题的列.
I don't know of any specific built-in functions for this. But, an application of some basic principles of linear algebra can determine this pretty easily. Below is a function I wrote which does this, followed by a more detailed explanation for those interested. The gist of it though is that we want to find the eigenvalues of X*X'
that are zero (for perfect collinearity) or close to zero (for near collinearity). Then, we find the eigenvectors associated with those eigenvalues. The components of those eigenvectors that are non-zero (for perfect collinearity) or moderately large-ish (a term that is ambiguous by the nature of "near collinearity" being ambiguous) are the columns that have collinearity problems.
function LinDep(A::Array, threshold1::Float64 = 1e-6, threshold2::Float64 = 1e-1; eigvec_output::Bool = false)
(L, Q) = eig(A'*A)
max_L = maximum(abs(L))
conditions = max_L ./ abs(L)
max_C = maximum(conditions)
println("Max Condition = $max_C")
Collinear_Groups = []
Tricky_EigVecs = []
for (idx, lambda) in enumerate(L)
if lambda < threshold1
push!(Collinear_Groups, find(abs(Q[:,idx]) .> threshold2))
push!(Tricky_EigVecs, Q[:,idx])
end
end
if eigvec_output
return (Collinear_Groups, Tricky_EigVecs)
else
return Collinear_Groups
end
end
简单的例子开始.很容易看到此矩阵存在共线性问题:
Simple example to start with. It's easy to see this matrix has collinearity problems:
A1 = [1 3 1 2 ; 0 0 0 0 ; 1 0 0 0 ; 1 3 1 2]
4x4 Array{Int64,2}:
1 3 1 2
0 0 0 0
1 0 0 0
1 3 1 2
Collinear_Groups1 = LinDep(A1)
[2,3]
[2,3,4]
Max Condition = 5.9245306995900904e16
这里有两个等于0的特征值.因此,该函数为我们提供了两组问题"列.我们要删除此处的一列或多列以解决共线性问题.显然,与共线性的本质一样,没有正确"的答案.例如. Col3显然只是Col4的1/2.因此,我们可以删除其中一个来解决共线性问题.
There are two eigenvalues that equal 0 here. Thus, the function gives us two sets of "problem" columns. We want to remove one or more of the columns here to address the collinearity. Clearly, as with the nature of collinearity, there is no "right" answer. E.g. Col3 is clearly just 1/2 of Col4. So, we could remove either one to address that collinearity issue.
注意:这里的最大条件是最大特征值与每个其他特征值的最大比值.一般准则是,最大条件> 100表示中等共线性,而> 1000表示严重共线性(请参见例如 Wikipedia .)但是,很多取决于您所处的情况,因此,不建议仅依赖于这种简单的规则.最好将其视为众多因素中的一个因素,其中包括对特征向量的分析以及对基础数据的了解以及怀疑共线性可能存在或不存在的地方.无论如何,我们看到这里的空间很大,这是可以预料的.
Note: the max condition here is the largest ratio of the the maximum eigenvalue to each of the other eigenvalues. A general guideline is that a max condition > 100 means moderate collinearity, and > 1000 means severe collinearity (see e.g. Wikipedia.) But, a LOT depends on the specifics of your situation, and so relying on simplistic rules like this is not particularly advisable. It's much better to consider this as one factor amongst many, including also things like analysis of the eigenvectors and your knowledge of the underlying data and where you suspect collinearity might or might not be present. In any case, we see that it is huge here, which is to be expected.
现在,让我们考虑一个困难的情况,即没有完美的共线性,而只有接近共线性.我们可以按原样使用该函数,但是我认为打开eigvec_output
选项会很有帮助,让我们查看与有问题的特征值相对应的特征向量.另外,您可能还想对指定的阈值进行一些修改,以调整对接近共线性的拾取的敏感度.或者,只需将它们都设置得非常大(尤其是第二个),然后花费大部分时间检查电子输出.
Now, lets consider a harder situation where there isn't perfect collinearity, but just near collinearity. We can use the function as is, but I think it is helpful to switch on the eigvec_output
option to let us see the eigenvectors that correspond with the problematic eigenvalues. Also, you may well want to do some tinkering with the thresholds specified, in order to adjust the sensitivity to picking up near collinearity. Or, just set them both pretty big (particularly the second one) and spend most of your time examining the eigector outputs.
srand(42); ## set random seed for reproducibility
N = 10
A2 = rand(N,N);
A2[:,2] = 2*A2[:,3] +0.8*A2[:,4] + (rand(N,1)/100); ## near collinearity
(Collinear_Groups2, Tricky_EigVecs2) = LinDep(A2, eigvec_output = true)
Max Condition = 4.6675275950744677e8
我们的最大状况现在显着变小了,虽然很好,但显然仍然很严重.
Our max condition is notably smaller now, which is nice, but still clearly quite severe.
Collinear_Groups2
1-element Array{Any,1}:
[2,3,4]
Tricky_EigVecs2[1]
julia> Tricky_EigVecs2[1]
10-element Array{Float64,1}:
0.00537466
0.414383
-0.844293
-0.339419
0.00320918
0.0107623
0.00599574
-0.00733916
-0.00128179
-0.00214224
在这里,我们看到第2、3、4列具有与之关联的特征向量的相对较大的分量.这表明我们这些是接近共线性的问题列,这当然是我们在创建矩阵时所期望的!
Here, we see that columns 2,3,4 have relatively large components of the eigenvector associated with them. This shows us that these are the problematic columns for near collinearity, which of course is what we expected given how we created our matrix!
在基本线性代数中,任何对称矩阵都可以对角线为
From basic linear algebra, any symmetric matrix can be diagonalized as:
A = Q * L * Q'
其中L
是包含其特征值的对角矩阵,而Q
是其相应特征向量的矩阵.
Where L
is a diagonal matrix containing its eigenvalues and Q
is a matrix of its corresponding eigenvectors.
因此,假设我们在回归分析中有一个设计矩阵X
.矩阵X'X
将始终是对称的,因此如上所述可对角化.
Thus, suppose that we have a design matrix X
in a regression analysis. The matrix X'X
will always be symmetric and thus diagonalizable as described above.
类似地,我们将始终具有rank(X) = rank(X'X)
的含义,这意味着如果X
包含线性相关的列并且小于满秩,则X'X
也会如此.
Similarly, we will always have rank(X) = rank(X'X)
meaning that if X
contains linearly dependent columns and is less than full rank, so will be X'X
.
现在,回想一下,通过特征值(L[i]
)和特征向量的定义Q[:,i]
,我们有:
Now, recall that by the definition of an eigenvalue ( L[i]
) and an eigenvector Q[:,i]
, we have:
A * Q[:,i] = L[i] * Q[:,i]
在L[i] = 0
的情况下,它将变为:
In the case that L[i] = 0
then this becomes:
A * Q[:,i] = 0
用于某些非零的Q[:,i]
.这是A
具有线性相关列的定义.
for some non-zero Q[:,i]
. This is the definition of A
having a linearly dependent column.
此外,由于A * Q[:,i] = 0
可以重写为A
列的总和,而这些总和由Q[:,i]
的分量加权.因此,如果让S1和S2为两个互斥集,则我们有
Furthermore, since A * Q[:,i] = 0
can be rewritten as the sum of the columns of A
weighted by the components of Q[:,i]
. Thus, if we let S1 and S2 be two mutually exclusive sets, then we have
sum (j in S1) A[:,j]*Q[:,i][j] = sum (j in S2) A[:,j]*Q[:,i][j]
即A列的某些组合可以写为其他列的加权组合.
I.e. some combination of columns of A can be written as a weighted combination of the other columns.
因此,例如,如果我们知道某个i
的L[i] = 0
,然后我们查看相应的Q[:,i]
并看到Q[:,i] = [0 0 1 0 2 0]
,那么我们知道列3
= -2
乘以列5
,因此我们要删除其中一个.
Thus, if we knew for instance that L[i] = 0
for some i
, and then we look at the corresponding Q[:,i]
and see Q[:,i] = [0 0 1 0 2 0]
then we know that column 3
= -2
times column 5
and thus we want to remove one or the other.
这篇关于朱莉娅的完美(或接近)多重共线性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!