问题描述
我正在尝试用列名绘制某些基于树的模型的功能重要性.我正在使用Pyspark.
I am trying to plot the feature importances of certain tree based models with column names. I am using Pyspark.
由于我也拥有文本分类变量和数字变量,因此我不得不使用类似这样的管道方法-
Since I had textual categorical variables and numeric ones too, I had to use a pipeline method which is something like this -
- 使用字符串索引器为字符串列编制索引
- 对所有列使用一个热编码器
-
使用向量汇编器创建包含特征向量的特征列
- use string indexer to index string columns
- use one hot encoder for all columns
use a vectorassembler to create the feature column containing the feature vector
来自 docs 对于步骤1,2,3-
Some sample code from the docs for steps 1,2,3 -
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer,
VectorAssembler
categoricalColumns = ["workclass", "education", "marital_status",
"occupation", "relationship", "race", "sex", "native_country"]
stages = [] # stages in our Pipeline
for categoricalCol in categoricalColumns:
# Category Indexing with StringIndexer
stringIndexer = StringIndexer(inputCol=categoricalCol,
outputCol=categoricalCol + "Index")
# Use OneHotEncoder to convert categorical variables into binary
SparseVectors
# encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index",
outputCol=categoricalCol + "classVec")
encoder = OneHotEncoderEstimator(inputCols=
[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
# Add stages. These are not run here, but will run all at once later on.
stages += [stringIndexer, encoder]
numericCols = ["age", "fnlwgt", "education_num", "capital_gain",
"capital_loss", "hours_per_week"]
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
# Create a Pipeline.
pipeline = Pipeline(stages=stages)
# Run the feature transformations.
# - fit() computes feature statistics as needed.
# - transform() actually transforms the features.
pipelineModel = pipeline.fit(dataset)
dataset = pipelineModel.transform(dataset)
最终训练模型
finally train the model
经过培训和评估后,我可以使用"model.featureImportances"来获得特征排名,但是我没有得到特征/列名称,而只是获得特征编号,像这样-
after training and eval, I can use the "model.featureImportances" to get the feature rankings, however I dont get the feature/column names, rather just the feature number, something like this -
print dtModel_1.featureImportances
(38895,[38708,38714,38719,38720,38737,38870,38894],[0.0742343395738,0.169404823667,0.100485791055,0.0105823115814,0.0134236162982,0.194124862158,0.437744255667])
如何将其映射回初始列名称和值?这样我就可以绘图了吗?**
How do I map it back to the initial column names and the values? So that I can plot ?**
推荐答案
将元数据提取为此处显示的,由 user6910411
attrs = sorted(
(attr["idx"], attr["name"]) for attr in (chain(*dataset
.schema["features"]
.metadata["ml_attr"]["attrs"].values())))
并结合功能重要性:
[(name, dtModel_1.featureImportances[idx])
for idx, name in attrs
if dtModel_1.featureImportances[idx]]
这篇关于列转换后的Pyspark随机森林特征重要性映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!