问题描述
我有一组分类列(字符串),我正在解析这些列并将其转换为要素向量,以传递给mllib分类器(随机森林).
I have a set of categorical columns (strings), that I'm parsing and converting into Vectors of features to pass to a mllib classifier (random forest).
在我的输入数据中,某些列具有空值.说,在那些列之一中,我有p值+空值:我应该如何构建我的特征向量和分类器的categoricalFeaturesInfo映射?
In my input data, some columns have null values. Say, in one of those columns, I have p values + a null value :How should I build my feature Vectors, and the categoricalFeaturesInfo map of the classifier ?
- 选项1:我告诉categoricalFeaturesInfo中的p值,并在输入Vector中使用Double.NaN?
- 另外一个问题:分类器如何处理NaN?
- option 1 : I tell p values in categoricalFeaturesInfo, and I use Double.NaN in my input Vectors ?
- side question : How NaNs are handled by classifiers ?
感谢您的帮助.
(PS:我知道新的数据框+管道+ vectorindexer API,但是由于某种原因它不能很好地满足我的需要,所以我需要自己做)
(PS : I know the the new dataframe + pipeline + vectorindexer API, but for reasons it doesn't fit well my need, so I need to do that by myself)
推荐答案
好像是选项2.如果为您设置的null实际上是分类功能的另一层,只需将其映射为某个值即可. 请注意,在正确使用分类特征级别之前,应将其映射到0,1,2 ....等,请参见此处:
Looks like option 2 is the one. If null for you is actually another level of your categorical feature, just map it into some value. Note that the categorical feature levels should be mapped into 0,1,2 .... etc before you can properly use them, see here:
因此,空值将被映射到这些数字之一.
So nulls will be mapped into one of these numbers.
这篇关于在Spark mllib分类器中处理null/NaN值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!