本文介绍了Python3:夏普树解释器:在"array_dealloc"中忽略了异常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行xgboost进行机器学习,并且在使用 XGBClassifier 成功完成机器学习之后,我想绘制结果图.

I'm running xgboost for machine learning, and after successful completion of my machine learning using XGBClassifier, I want to make plots of the results.

我的JSON格式输入数据的一个最小工作示例:

A minimal working example of my input data in JSON format:

遵循 https://evgenypogorelov.com/multiclass-xgb-shap.html

我的剧本:

import mlflow
import sys
import json
import mlflow.sklearn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split, KFold, cross_val_score
import xgboost
import shap
from sklearn.metrics import accuracy_score, precision_score, plot_roc_curve

def ref_to_json_file(data, filename):
    json1=json.dumps(data)
    f = open(filename,"w+")
    print(json1,file=f)

def xgbclassifier_wrapper( json_file, dependent_var, output_stem):
  #https://xgboost.readthedocs.io/en/latest/parameter.html
  pandasDF = pd.read_json(json_file)
  bool_cols = ["Deceased", "sex"]#, 'Hospitalized', 'Respiratory_Support', 'sex']
  for col in bool_cols:
    pandasDF[col] = pandasDF[col]=='True'
  Y = pandasDF[dependent_var]
  X = pandasDF.drop([dependent_var], axis=1)
  
  X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
  mlflow.sklearn.autolog()

  # With autolog() enabled, all model parameters, a model score, and the fitted model are automatically logged.  
  with mlflow.start_run():
    # Set the model parameters. 
    n_estimators = 200
    colsample_bytree = 0.3
    learning_rate = 0.05
    max_depth = 6# default 6; max. depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 is only accepted in lossguided growing policy when tree_method is set as hist or gpu_hist and it indicates no limit on depth. Beware that XGBoost aggressively consumes memory when training a deep tree.
    #min_child_rate = 0
    gamma = 0 # default = 0; Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be.

    # Create and train model.
    xg_clf = xgboost.XGBClassifier( n_estimators=n_estimators, colsample_bytree=colsample_bytree, learning_rate=learning_rate, max_depth=max_depth)
    xg_clf.fit(X_train, y_train)
    # Use the model to make predictions on the test dataset.
    predictions = xg_clf.predict(X_test)
  accuracy = accuracy_score(y_test, predictions)
  pre_score  = precision_score(y_test, predictions)
  feature_importances = pd.DataFrame(xg_clf.feature_importances_, index=X.columns, columns=['importance'])
  feature_importances.to_json("data/" + output_stem + '.feature_importances.json')
  kfold = KFold(n_splits=10)
  results = cross_val_score(xg_clf, X, Y, cv=kfold)
  accuracy = results.mean() * 100
  roc = plot_roc_curve(xg_clf, X_test, y_test, name = dependent_var)
  return accuracy

json_file = 'debug.json'#"/home/con/covid_study2065/data/pat.data.array.json"
if not os.path.isfile(json_file):
    sys.exit("json file doesn't exist.")
deceased = xgbclassifier_wrapper(json_file, "Deceased", 'debug')
explainer = shap.TreeExplainer(deceased.xg_clf, model_output = "raw", feature_perturbation="interventional", data = deceased.X)

explainer = shap.TreeExplainer(deceased.xg_clf, model_output = "probability", feature_perturbation="interventional", data = deceased.X)

出现错误:

Exception ignored in: 'array_dealloc'
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/shap/explainers/_tree.py", line 1353, in __init__
    _cext.dense_tree_update_weights(
SystemError: <class 'DeprecationWarning'> returned a result with an error set
Found a NULL input array in _cext_dense_tree_update_weights!
Traceback (most recent call last):
  File "debug.py", line 97, in <module>
    explainer = shap.TreeExplainer(deceased.xg_clf, model_output = "probability", feature_perturbation="interventional", data = deceased.X)
  File "/usr/local/lib/python3.8/dist-packages/shap/explainers/_tree.py", line 147, in __init__
    self.model = TreeEnsemble(model, self.data, self.data_missing, model_output)
  File "/usr/local/lib/python3.8/dist-packages/shap/explainers/_tree.py", line 827, in __init__
    self.trees = xgb_loader.get_trees(data=data, data_missing=data_missing)
  File "/usr/local/lib/python3.8/dist-packages/shap/explainers/_tree.py", line 1522, in get_trees
    trees.append(SingleTree({
  File "/usr/local/lib/python3.8/dist-packages/shap/explainers/_tree.py", line 1353, in __init__
    _cext.dense_tree_update_weights(
SystemError: <built-in function dense_tree_update_weights> returned NULL without setting an error

当我查看输入到 shap.TreeExplainer deceased.xg_clf 时:

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.3, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.05, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=200, n_jobs=1, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

XGBClassifer 的输入调整为与教程使用的相同参数,即.

Adjusting the input to XGBClassifer to the same parameters that the tutorial used, viz.

xgboost.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1.0,
         gamma=0.0, max_delta_step=0.0, min_child_weight=1.0,
         missing=None, n_jobs=-1, objective='binary:logistic', random_state=42, reg_alpha=0.0,
         reg_lambda=1.0, scale_pos_weight=1.0, tree_method='auto')

也给出了与我的参数相同的错误.

also gives the same error as my parameters.

我真的不知道是什么导致了此错误,并且该消息也没有帮助:我从未做过像 array_alloc 这样的事情,我认为这是C级的事情.

I have literally no idea what's causing this error, and this message isn't helpful: I never did anything like array_alloc, which I thought was a C-level thing to do.

执行参数grid_search时也会发生此错误.

this error also occurs when doing a parameter grid_search.

我正在使用 shap 0.38.1 在VM上的Ubuntu 18.04上运行Python 3.8.0,该错误也发生在Python 3.8.5上.使用Ubuntu 20.04.2 LTS(Focal Fossa)64位,Linux内核5.8.044-通用x86_64时也会发生此错误.

I'm running Python 3.8.0 on Ubuntu 18.04 on a VM, using shap 0.38.1 The error also occurs on Python 3.8.5. The error also occurs with Ubuntu 20.04.2 LTS (Focal Fossa) 64-bit, linux kernel 5.8.044-generic x86_64.

更新到新版0.39.0并没有帮助.

Updating to shap version 0.39.0 did not help.

我尝试更新到Python 3.8.8,但这使情况变得更糟,因为 shap 的依赖项之一与该版本不兼容:

I tried updating to Python 3.8.8, but that made the situation even worse, because one of the dependencies of shap isn't compatible with that version:

Collecting slicer==0.0.7 (from shap)
  Could not find a version that satisfies the requirement slicer==0.0.7 (from shap) (from versions: )
No matching distribution found for slicer==0.0.7 (from shap)

我已经在其GitHub页面上打开了一个问题: https://github.com/slundberg/shap/issues/1844

I've opened an issue on their GitHub page: https://github.com/slundberg/shap/issues/1844

另外,我的xgboost,numpy和scipy版本都是最新的:

also, my versions of xgboost, numpy, and scipy are all up-to-date:

Requirement already up-to-date: xgboost in /usr/local/lib/python3.8/dist-packages (1.3.3)
Requirement already satisfied, skipping upgrade: numpy in /usr/local/lib/python3.8/dist-packages (from xgboost) (1.19.5)
Requirement already satisfied, skipping upgrade: scipy in /usr/local/lib/python3.8/dist-packages (from xgboost) (1.6.1)

如何运行 shap 库?

还是...我可以使用 shap 的竞争对手吗?

or... is there some competitor to shap that I could use?

推荐答案

解决方案是,给TreeExplainer的命令中有错误.问题在于错误消息是少于真棒".解决方案:

The solution was that there was an error in the commands to TreeExplainer. The problem is that the error message was "Less than Awesome". The solution:

import mlflow
import sys, os
import json
import mlflow.sklearn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split, KFold, cross_val_score
import xgboost
import shap
from sklearn.metrics import accuracy_score, precision_score, plot_roc_curve

def ref_to_json_file(data, filename):
    json1=json.dumps(data)
    f = open(filename,"w+")
    print(json1,file=f)

class xgb_result:
  def __init__(self, xgb_result, X_test):
    self.xgb_result = xgb_result
    self.X_test     = X_test

def xgbclassifier_wrapper( json_file, dependent_var, output_stem):
  #https://xgboost.readthedocs.io/en/latest/parameter.html
  pandasDF = pd.read_json(json_file)
  bool_cols = ["Deceased", "sex"]#, 'Hospitalized', 'Respiratory_Support', 'sex']
  for col in bool_cols:
    pandasDF[col] = pandasDF[col]=='True'
  Y = pandasDF[dependent_var]
  X = pandasDF.drop([dependent_var], axis=1)
  
  X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
  mlflow.sklearn.autolog()

  # With autolog() enabled, all model parameters, a model score, and the fitted model are automatically logged.  
  with mlflow.start_run():
    # Set the model parameters. 
    n_estimators = 200
    colsample_bytree = 0.3
    learning_rate = 0.05
    max_depth = 6# default 6; max. depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 is only accepted in lossguided growing policy when tree_method is set as hist or gpu_hist and it indicates no limit on depth. Beware that XGBoost aggressively consumes memory when training a deep tree.
    #min_child_rate = 0
    gamma = 0 # default = 0; Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be.

    # Create and train model.
    xg_clf = xgboost.XGBClassifier( n_estimators=n_estimators, colsample_bytree=colsample_bytree, learning_rate=learning_rate, max_depth=max_depth)
    xg_clf.fit(X_train, y_train)
    # Use the model to make predictions on the test dataset.
    predictions = xg_clf.predict(X_test)
  accuracy = accuracy_score(y_test, predictions)
  pre_score  = precision_score(y_test, predictions)
  feature_importances = pd.DataFrame(xg_clf.feature_importances_, index=X.columns, columns=['importance'])
  feature_importances.to_json("data/" + output_stem + '.feature_importances.json')
  kfold = KFold(n_splits=10)
  results = cross_val_score(xg_clf, X, Y, cv=kfold)
  accuracy = results.mean() * 100
  roc = plot_roc_curve(xg_clf, X_test, y_test, name = dependent_var)
  return_object = xgb_result(xg_clf, X_test)
  return return_object

json_file = 'debug.json'#"/home/con/covid_study2065/data/pat.data.array.json"
if not os.path.isfile(json_file):
    sys.exit("json file doesn't exist.")
deceased = xgbclassifier_wrapper(json_file, "Deceased", 'debug')

shap_values = shap.TreeExplainer(deceased.xgb_result).shap_values(deceased.X_test)
shap_interaction_values = shap.TreeExplainer(deceased.xgb_result).shap_interaction_values(deceased.X_test)

#explainer = shap.TreeExplainer(deceased, model_output = "raw", feature_perturbation="interventional", data = deceased.X)

#explainer = shap.TreeExplainer(deceased.xg_clf, model_output = "probability", feature_perturbation="interventional", data = deceased.X)

这篇关于Python3:夏普树解释器:在"array_dealloc"中忽略了异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-12 03:00