本文介绍了在给定RDD的情况下如何训练SparkML梯度提升分类器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
给出以下rdd
training_rdd = rdd.select(
# Categorical features
col('device_os'), # 'ios', 'android'
# Numeric features
col('30day_click_count'),
col('30day_impression_count'),
np.true_divide(col('30day_click_count'), col('30day_impression_count')).alias('30day_click_through_rate'),
# label
col('did_click').alias('label')
)
我对训练梯度增强分类器的语法感到困惑.
I am confused about the syntax to train a gradient boosting classifer.
但是,我不确定如何将4个要素列放入向量中.因为VectorIndexer假定所有功能都已经在同一列中.
However, I am unsure about how to get my 4 feature columns into a vector. Because VectorIndexer assumes that all the features are already in one column.
推荐答案
您可以使用 VectorAssembler
生成特征向量.请注意,您必须先将 rdd
转换为 DataFrame
.
You can use VectorAssembler
to generate the feature vector. Please note that you will have to convert your rdd
to a DataFrame
first.
from pyspark.ml.feature import VectorAssembler
vectorizer = VectorAssembler()
vectorizer.setInputCols(["device_os",
"30day_click_count",
"30day_impression_count",
"30day_click_through_rate"])
vectorizer.setOutputCol("features")
因此,您需要将 vectorizer
作为第一阶段放入 Pipeline
:
And consequently, you will need to put vectorizer
as the first stage into the Pipeline
:
pipeline = Pipeline([vectorizer, ...])
这篇关于在给定RDD的情况下如何训练SparkML梯度提升分类器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!