问题描述
我正在尝试将数据集放入MATLAB [ranked,weights] = relieff(X,Ylogical,10, 'categoricalx', 'on')
函数中,以对预测变量的重要性进行排名. dataset<double n*m>
具有n
观测值和m
离散(即分类)特征.碰巧我的数据集中的每个观察值(行)至少都有一个NaN值.这些NaN代表数据集中未观察到的预测值(即缺失或为空). (数据集中没有损坏,只是不完整.)
I am trying to put my dataset into the MATLAB [ranked,weights] = relieff(X,Ylogical,10, 'categoricalx', 'on')
function to rank the importance of my predictor features. The dataset<double n*m>
has n
observations and m
discrete (i.e. categorical) features. It happens that each observation (row) in my dataset has at least one NaN value. These NaNs represent unobserved, i.e. missing or null, predictor values in the dataset. (There is no corruption in the dataset, it is just incomplete.)
relieff()使用下面的此功能删除包含NaN的所有行:
relieff() uses this function below to remove any rows that contain a NaN:
function [X,Y] = removeNaNs(X,Y)
% Remove observations with missing data
NaNidx = bsxfun(@or,isnan(Y),any(isnan(X),2));
X(NaNidx,:) = [];
Y(NaNidx,:) = [];
这并不理想,特别是对于我的情况,因为它留下了X=[]
和Y=[]
(即没有观察到的结果!)
This is not ideal, especially for my case, since it leaves me with X=[]
and Y=[]
(i.e. no observations!)
在这种情况下:
1)将所有NaN替换为随机值,例如99999,有帮助吗?通过这样做,我为所有预测器特征引入了新的特征状态,因此我认为它不是理想的.
1) Would replacing all NaN's with a random value, e.g. 99999, help? By doing this, I am introducing a new feature state for all the predictor features so I guess it is not ideal.
2)还是在统计上用相应特征列向量的模式(如下所示)替换NaN? (为了清晰起见,我没有进行矢量化处理)
2) or is replacing NaNs with the mode of the corresponding feature column vector (as below) statistically more sound? (I am not vectorising for clarity's sake)
function [matrixdata] = replaceNaNswithModes(matrixdata)
for i=1: size(matrixdata,2)
cv= matrixdata(:,i);
modevalue= mode(cv);
cv(find(isnan(cv))) = modevalue;
matrixdata(:,i) = cv;
end
3)还是对分类"数据有意义的其他明智方法?
3) Or any other sensible way that would make sense for "categorical" data?
PS:此链接提供了处理丢失数据的可能方法.
P.S: This link gives possible ways to handle missing data.
推荐答案
我建议使用表格而不是矩阵.然后,您具有诸如ismissing(针对整个表)和isundefined之类的功能,以处理分类变量的缺失值.
I suggest to use a table instead of a matrix.Then you have functions such as ismissing (for the entire table), and isundefined to deal with missing values for categorical variables.
T = array2table(matrix);
T = standardizeMissing(T); % NaN is standard for double but this
% can be useful for other data type
var1 = categorical(T.var1);
missing = isundefined(var1);
T = T(missing,:); % removes lines with NaN
matrix = table2array(T);
这篇关于MatLab分类数据中缺少数据处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!