模型系列：增益模型Uplift Modeling原理和案例

。这就是为什么我们只能估计：

$ \widehat{CATE} (uplift) = E[Y_i|X_i = x, W_i = 1] - E[Y_i|X_i = x, W_i = 0] $，其中 $ Y^1_i = Y_i = Y^1_i if W_i = 1$ and Y i 0 Y^0_i Yi0 where $W_i = 0 $

注意！ W i W_i Wi 应该在给定 X i X_i Xi 的条件下与 Y i 1 Y^1_i Yi1 和 Y i 0 Y^0_i Yi0 独立。

有两种类型的增益模型：

`元学习器` - 转换问题并使用经典的机器学习模型

`直接增益模型` - 直接预测增益的算法。

1. 初步步骤


import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from tqdm.notebook import tqdm
import seaborn as sns
from statsmodels.graphics.gofplots import qqplot
!pip install scikit-uplift -q
from sklift.metrics import uplift_at_k, uplift_auc_score, qini_auc_score, weighted_average_uplift
from sklift.viz import plot_uplift_preds
from sklift.models import SoloModel, TwoModels
import xgboost as xgb

[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m



# 读取csv文件，并将其存储在train变量中
train = pd.read_csv('../input/megafon-uplift-competition/train (1).csv')

我们看到了许多隐藏的特征X_1-X_50，二元处理（以对象格式）和二元转换



# 查看训练数据的前几行
train.head()

5 rows × 53 columns

# 对训练数据集进行信息描述
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600000 entries, 0 to 599999
Data columns (total 53 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   id               600000 non-null  int64  
 1   treatment_group  600000 non-null  object 
 2   X_1              600000 non-null  float64
 3   X_2              600000 non-null  float64
 4   X_3              600000 non-null  float64
 5   X_4              600000 non-null  float64
 6   X_5              600000 non-null  float64
 7   X_6              600000 non-null  float64
 8   X_7              600000 non-null  float64
 9   X_8              600000 non-null  float64
 10  X_9              600000 non-null  float64
 11  X_10             600000 non-null  float64
 12  X_11             600000 non-null  float64
 13  X_12             600000 non-null  float64
 14  X_13             600000 non-null  float64
 15  X_14             600000 non-null  float64
 16  X_15             600000 non-null  float64
 17  X_16             600000 non-null  float64
 18  X_17             600000 non-null  float64
 19  X_18             600000 non-null  float64
 20  X_19             600000 non-null  float64
 21  X_20             600000 non-null  float64
 22  X_21             600000 non-null  float64
 23  X_22             600000 non-null  float64
 24  X_23             600000 non-null  float64
 25  X_24             600000 non-null  float64
 26  X_25             600000 non-null  float64
 27  X_26             600000 non-null  float64
 28  X_27             600000 non-null  float64
 29  X_28             600000 non-null  float64
 30  X_29             600000 non-null  float64
 31  X_30             600000 non-null  float64
 32  X_31             600000 non-null  float64
 33  X_32             600000 non-null  float64
 34  X_33             600000 non-null  float64
 35  X_34             600000 non-null  float64
 36  X_35             600000 non-null  float64
 37  X_36             600000 non-null  float64
 38  X_37             600000 non-null  float64
 39  X_38             600000 non-null  float64
 40  X_39             600000 non-null  float64
 41  X_40             600000 non-null  float64
 42  X_41             600000 non-null  float64
 43  X_42             600000 non-null  float64
 44  X_43             600000 non-null  float64
 45  X_44             600000 non-null  float64
 46  X_45             600000 non-null  float64
 47  X_46             600000 non-null  float64
 48  X_47             600000 non-null  float64
 49  X_48             600000 non-null  float64
 50  X_49             600000 non-null  float64
 51  X_50             600000 non-null  float64
 52  conversion       600000 non-null  int64  
dtypes: float64(50), int64(2), object(1)
memory usage: 242.6+ MB

特征没有标准化或归一化

# 使用describe()函数对训练数据集进行描述性统计分析
train.describe()

8 rows × 52 columns

然而，很有可能这些特征已经被清除。每个特征分布看起来都很正常。

# 给代码添加中文注释

# 设置行数和列数
rows, cols = 10, 5

# 创建一个包含多个子图的图形对象，并设置图形的大小
f, axs = plt.subplots(nrows=rows, ncols=cols, figsize=(20, 25))

# 设置图形的背景颜色为白色
f.set_facecolor("#fff")

# 设置特征数量为1
n_feat = 1

# 遍历每一行
for row in tqdm(range(rows)):
    # 遍历每一列
    for col in range(cols):
        try:
            # 绘制核密度估计图，并设置填充、透明度、线宽、边缘颜色等参数
            sns.kdeplot(x=f'X_{n_feat}', fill=True, alpha=1, linewidth=3, 
                        edgecolor="#264653", data=train, ax=axs[row, col], color='w')
            
            # 设置子图的背景颜色为深绿色，并设置透明度
            axs[row, col].patch.set_facecolor("#619b8a")
            axs[row, col].patch.set_alpha(0.8)
            
            # 设置子图的网格颜色和透明度
            axs[row, col].grid(color="#264653", alpha=1, axis="both")
        except IndexError: # 隐藏最后一个空图
            axs[row, col].set_visible(False)
        
        # 特征数量加1
        n_feat += 1

# 显示图形
f.show()

  0%|          | 0/10 [00:00<?, ?it/s]

模型系列：增益模型Uplift Modeling原理和案例-LMLPHP

只是为了确保，请看qq图。

# 设置子图的行数和列数
rows, cols = 10, 5

# 创建一个包含子图的图像对象
f, axs = plt.subplots(nrows=rows, ncols=cols, figsize=(20, 25))

# 设置图像的背景颜色为白色
f.set_facecolor("#fff")

# 设置特征数量为1
n_feat = 1

# 遍历每一行
for row in tqdm(range(rows)):
    # 遍历每一列
    for col in range(cols):
        try:
            # 绘制核密度估计图
            # sns.kdeplot(x=f'X_{n_feat}', fill=True, alpha=1, linewidth=3, 
            #             edgecolor="#264653", data=train, ax=axs[row, col], color='w')
            
            # 绘制QQ图
            qqplot(train[f'X_{n_feat}'], ax=axs[row, col], line='q')
            
            # 设置网格线的颜色为深绿色
            axs[row, col].grid(color="#264653", alpha=1, axis="both")
        
        # 如果索引超出范围，则隐藏最后一个空图
        except IndexError:
            axs[row, col].set_visible(False)
        
        # 特征数量加1
        n_feat += 1

# 显示图像
f.show()

  0%|          | 0/10 [00:00<?, ?it/s]

模型系列：增益模型Uplift Modeling原理和案例-LMLPHP

接下来让我们集中精力进行建模。

2. 指标

由于我们在研究之前没有Uplift，我们不能使用经典的Meta-Learners指标。然而，我们需要比较模型并了解它们的准确性。

1. Uplift@k
我们所需要做的就是对值进行排序（降序），并计算治疗组和对照组中目标变量（Y）的平均差异：
U p l i f t @ k = m e a n ( Y t r e a t m e n t @ k ) − m e a n ( Y c o n t r o l @ k ) Uplift@k = mean(Y^{treatment}@k) - mean(Y^{control}@k) Uplift@k=mean(Ytreatment@k)−mean(Ycontrol@k)
Y @ k Y@k Y@k - 前k%的目标变量

2. 按百分位数（十分位数）计算Uplift
相同的方法，但这里我们分别计算每个十分位数的差异
使用按百分位数计算的Uplift，我们可以计算加权平均Uplift：
加权平均Uplift$ = \frac{N^T_i * uplift_i}{\sum{N^T_i}} $
N i T N^T_i NiT - i百分位数中治疗组的大小

3. Uplift曲线和AUUC
Uplift曲线是一个依赖于对象数量的累积Uplift函数：
uplift curve i = ( Y t T N t T − Y t C N t C ) ( N t T + N t C ) \text{uplift curve}_i = (\frac{Y^T_t}{N^T_t}-\frac{Y^C_t}{N^C_t}) (N^T_t + N^C_t) uplift curvei=(NtTYtT−NtCYtC)(NtT+NtC)
其中 t − 累积对象数量 , N − T和C组的大小 \text{其中 } t - \text{累积对象数量}, N - \text{T和C组的大小} 其中 t−累积对象数量,N−T和C组的大小

AUUC - Unplift曲线下的面积是随机Uplift曲线和模型曲线之间的面积，通过理想Uplift曲线下的面积进行归一化

4. Qini曲线和AUQC
Qini曲线是另一种累积函数的方法：
qini curve i = Y t T − Y t C N t T N t C \text{qini curve}_i = Y^T_t-\frac{Y^C_tN^T_t}{N^C_t} qini curvei=YtT−NtCYtCNtT

AUQC或Qini系数 - Qini曲线下的面积是随机Qini曲线和模型曲线之间的面积，通过理想Qini曲线下的面积进行归一化

train.columns

Index(['id', 'treatment_group', 'X_1', 'X_2', 'X_3', 'X_4', 'X_5', 'X_6',
       'X_7', 'X_8', 'X_9', 'X_10', 'X_11', 'X_12', 'X_13', 'X_14', 'X_15',
       'X_16', 'X_17', 'X_18', 'X_19', 'X_20', 'X_21', 'X_22', 'X_23', 'X_24',
       'X_25', 'X_26', 'X_27', 'X_28', 'X_29', 'X_30', 'X_31', 'X_32', 'X_33',
       'X_34', 'X_35', 'X_36', 'X_37', 'X_38', 'X_39', 'X_40', 'X_41', 'X_42',
       'X_43', 'X_44', 'X_45', 'X_46', 'X_47', 'X_48', 'X_49', 'X_50',
       'conversion'],
      dtype='object')

# 获取'treatment_group'列的唯一值
train['treatment_group'].unique()

array(['control', 'treatment'], dtype=object)

# 将'treatment_group'列中的值转换为0或1，如果值为'treatment'则转换为1，否则转换为0
train['treatment_group'] = train['treatment_group'].apply(lambda x: 1 if x=='treatment' else 0)

# 从sklearn库中导入train_test_split函数
from sklearn.model_selection import train_test_split

# 将train数据集的前100000行赋值给train变量
train = train[:100000]

# 从train数据集中选取名为'X_i'的特征列，其中i的取值范围为1到50，并将结果赋值给X变量
X = train[[f'X_{i}' for i in range(1, 51)]]

# 从train数据集中选取名为'treatment_group'的特征列，并将结果赋值给treatment变量
treatment = train['treatment_group']

# 从train数据集中选取名为'conversion'的特征列，并将结果赋值给y变量
y = train['conversion']

# 使用train_test_split函数将X、y和treatment按照指定的比例划分为训练集和验证集，并将划分结果分别赋值给X_train、X_val、y_train、y_val、treatment_train和treatment_val变量
X_train, X_val, y_train, y_val, treatment_train, treatment_val = train_test_split(X, y, treatment, test_size=0.2)

3. 元学习者

3.1 S-Learner

3.1 S-学习者

返回目录

S-learner的主要思想是使用特征、二进制处理（W）和二进制目标动作（Y）训练一个模型。然后使用常数W=1和W=0对测试数据进行预测。差异即为提升效果。

模型系列：增益模型Uplift Modeling原理和案例-LMLPHP

好消息是我们可以使用经典的机器学习分类器！让我们使用xgboost来做吧。你也可以尝试其他分类器并比较结果。

# 定义一个函数get_metrics，接受三个参数y_val, uplift, treatment_val
def get_metrics(y_val, uplift, treatment_val):
    # 计算指标

    # 计算前30%的提升值。按照组别排序控制组和处理组。整体排序。
    upliftk = uplift_at_k(y_true=y_val, uplift=uplift, treatment=treatment_val, strategy='by_group', k=0.3)
    upliftk_all = uplift_at_k(y_true=y_val, uplift=uplift, treatment=treatment_val, strategy='overall', k=0.3)

    # 计算Qini系数
    qini_coef = qini_auc_score(y_true=y_val, uplift=uplift, treatment=treatment_val)

    # 默认策略 - 整体排序
    # 计算提升曲线下面积
    uplift_auc = uplift_auc_score(y_true=y_val, uplift=uplift, treatment=treatment_val)
    # 计算加权平均提升值
    wau = weighted_average_uplift(y_true=y_val, uplift=uplift, treatment=treatment_val, strategy='by_group')
    wau_all = weighted_average_uplift(y_true=y_val, uplift=uplift, treatment=treatment_val)

    # 打印结果
    print(f'uplift at top 30% by group: {upliftk:.2f} by overall: {upliftk_all:.2f}\n',
          f'Weighted average uplift by group: {wau:.2f} by overall: {wau_all:.2f}\n',
          f'AUUC by group: {uplift_auc:.2f}\n',
          f'AUQC by group: {qini_coef:.2f}\n')
    
    # 返回一个包含指标结果的字典
    return {'uplift@30': upliftk, 'uplift@30_all': upliftk_all, 'AUQC': qini_coef, 'AUUC': uplift_auc, 
            'WAU': wau, 'WAU_all': wau_all}

# 创建一个XGBoost分类器模型，设置随机种子为42，目标函数为二元逻辑回归，禁用标签编码
xgb_sm = xgb.XGBClassifier(random_state=42, objective='binary:logistic', use_label_encoder=False)

# 创建一个SoloModel对象，使用上面创建的XGBoost分类器模型作为估计器
sm = SoloModel(estimator=xgb_sm)

# 使用训练数据集X_train、y_train和treatment_train来拟合SoloModel模型
sm = sm.fit(X_train, y_train, treatment_train, estimator_fit_params={})

# 使用拟合好的SoloModel模型对验证数据集X_val进行预测
uplift_sm = sm.predict(X_val)

# 使用get_metrics函数计算验证数据集的评估指标，包括y_val、uplift_sm和treatment_val
res = get_metrics(y_val, uplift_sm, treatment_val)

[12:14:57] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
uplift at top 30% by group: 0.18 by overall: 0.18
 Weighted average uplift by group: 0.04 by overall: 0.04
 AUUC by group: 0.15
 AUQC by group: 0.21

3.2 T-Learner

3.2 T学习器

跳转到目录

T-learner的主要思想是训练两个独立的模型：一个基于治疗后的观察数据（T），另一个基于对照组数据（C）。提升效果是模型T和模型C在数据上的预测差异。

模型系列：增益模型Uplift Modeling原理和案例-LMLPHP



# 初始化两个xgboost分类器，分别用于处理treatment组和control组
xgb_T = xgb.XGBClassifier(random_state=42, objective='binary:logistic', use_label_encoder=False)
xgb_C = xgb.XGBClassifier(random_state=42, objective='binary:logistic', use_label_encoder=False)

# 初始化TwoModels类，将treatment组和control组的分类器传入
sm = TwoModels(estimator_trmnt=xgb_T, estimator_ctrl=xgb_C)

# 使用训练数据拟合模型
sm = sm.fit(X_train, y_train, treatment_train, estimator_trmnt_fit_params={}, estimator_ctrl_fit_params={})

# 对验证集进行预测
uplift_sm = sm.predict(X_val)

# 计算模型的评估指标
res = get_metrics(y_val, uplift_sm, treatment_val)

[12:15:47] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[12:16:11] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
uplift at top 30% by group: 0.17 by overall: 0.17
 Weighted average uplift by group: 0.04 by overall: 0.04
 AUUC by group: 0.13
 AUQC by group: 0.18

结果比S-learner稍差。

3.3 T-Learner依赖模型

T-learner与依赖模型的主要思想是使用相反模型的预测（概率）来训练T或C模型。
这种方法是从分类器链方法中采用的：https://scikit-learn.org/stable/auto_examples/multioutput/plot_classifier_chain_yeast.html

模型系列：增益模型Uplift Modeling原理和案例-LMLPHP

这种方法有两种可能的实现方式：基于C模型中的T-probs和基于T模型中的C-probs：

u p l i f t i = P T ( x i , P C ( X ) ) − P C ( x i ) uplift_i = P^T(x_i, P^C(X)) - P^C(x_i) uplifti=PT(xi,PC(X))−PC(xi)
u p l i f t i = P T ( x i ) − P C ( x i , P T ( x i ) ) uplift_i = P^T(x_i) - P^C(x_i, P^T(x_i)) uplifti=PT(xi)−PC(xi,PT(xi))

第一种方法：


# 创建两个XGBClassifier对象，分别作为treatment模型和control模型
xgb_T = xgb.XGBClassifier(random_state=42, objective='binary:logistic', use_label_encoder=False)
xgb_C = xgb.XGBClassifier(random_state=42, objective='binary:logistic', use_label_encoder=False)

# 创建TwoModels对象，将treatment模型和control模型传入，并指定方法为'ddr_control'
sm = TwoModels(estimator_trmnt=xgb_T, estimator_ctrl=xgb_C, method='ddr_control')

# 使用训练数据拟合TwoModels对象
sm = sm.fit(X_train, y_train, treatment_train, estimator_trmnt_fit_params={}, estimator_ctrl_fit_params={})

# 使用拟合好的TwoModels对象对验证数据进行预测
uplift_sm = sm.predict(X_val)

# 使用预测结果和验证数据计算评估指标
res = get_metrics(y_val, uplift_sm, treatment_val)

[12:16:37] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[12:17:02] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
uplift at top 30% by group: 0.17 by overall: 0.17
 Weighted average uplift by group: 0.04 by overall: 0.04
 AUUC by group: 0.12
 AUQC by group: 0.18

第二种方法：


# 创建两个XGBoost分类器，用于处理treatment组和control组
xgb_T = xgb.XGBClassifier(random_state=42, objective='binary:logistic', use_label_encoder=False)
xgb_C = xgb.XGBClassifier(random_state=42, objective='binary:logistic', use_label_encoder=False)

# 创建TwoModels对象，使用xgb_T和xgb_C作为估计器，并选择ddr_treatment方法
sm = TwoModels(estimator_trmnt=xgb_T, estimator_ctrl=xgb_C, method='ddr_treatment')

# 使用训练数据拟合TwoModels对象
sm = sm.fit(X_train, y_train, treatment_train, estimator_trmnt_fit_params={}, estimator_ctrl_fit_params={})

# 对验证数据进行预测
uplift_sm = sm.predict(X_val)

# 计算模型的评估指标
res = get_metrics(y_val, uplift_sm, treatment_val)

[12:17:27] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[12:17:51] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
uplift at top 30% by group: 0.17 by overall: 0.17
 Weighted average uplift by group: 0.04 by overall: 0.04
 AUUC by group: 0.13
 AUQC by group: 0.19

参考资料：
https://www.uplift-modeling.com/en/latest/

愤斗的橘子