我想使用StandardScaler
缩放数据。我已经将数据加载到PythonRDD中。数据似乎稀疏。要应用StandardScaler
,我们首先应将其转换为密集类型。
trainData = MLUtils.loadLibSVMFile(sc, trainDataPath)
valData = MLUtils.loadLibSVMFile(sc, valDataPath)
trainLabel = trainData.map(lambda x: x.label)
trainFeatures = trainData.map(lambda x: x.features)
valLabel = valData.map(lambda x: x.label)
valFeatures = valData.map(lambda x: x.features)
scaler = StandardScaler(withMean=True, withStd=True).fit(trainFeatures)
# apply the scaler into the data. Here, trainFeatures is a sparse PythonRDD, we first convert it into dense tpye
trainFeatures_scaled = scaler.transform(trainFeatures)
valFeatures_scaled = scaler.transform(valFeatures)
# merge `trainLabel` and `traiFeatures_scaled` into a new PythonRDD
trainData1 = ...
valData1 = ...
# using the scaled data, i.e., trainData1 and valData1 to train a model
...
上面的代码有错误。我有两个问题:
如何将稀疏的PythonRDD
trainFeatures
转换为可以作为StandardScaler
输入的密集tpye?如何将
trainLabel
和trainFeatures_scaled
合并到新的LabeledPoint中,以用于训练分类器(例如,随机森林)?我仍然找到有关此的任何文档或参考。
最佳答案
要使用toArray
转换为密集地图,请执行以下操作:
dense = valFeatures.map(lambda v: DenseVector(v.toArray()))
合并zip:
valLabel.zip(dense).map(lambda (l, f): LabeledPoint(l, f))
关于python - 如何将稀疏数据的PythonRDD转换为密集的PythonRDD,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/37358865/