。这就是为什么我们只能估计 :
$ \widehat{CATE} (uplift) = E[Y_i|X_i = x, W_i = 1] - E[Y_i|X_i = x, W_i = 0] $,其中 $ Y^1_i = Y_i = Y^1_i if W_i = 1$ and Y i 0 Y^0_i Yi0 where $W_i = 0 $
注意! W i W_i Wi 应该在给定 X i X_i Xi 的条件下与 Y i 1 Y^1_i Yi1 和 Y i 0 Y^0_i Yi0 独立。
有两种类型的增益模型:
-
元学习器
- 转换问题并使用经典的机器学习模型 -
直接增益模型
- 直接预测增益的算法。
1. 初步步骤
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from tqdm.notebook import tqdm
import seaborn as sns
from statsmodels.graphics.gofplots import qqplot
!pip install scikit-uplift -q
from sklift.metrics import uplift_at_k, uplift_auc_score, qini_auc_score, weighted_average_uplift
from sklift.viz import plot_uplift_preds
from sklift.models import SoloModel, TwoModels
import xgboost as xgb
[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m
# 读取csv文件,并将其存储在train变量中
train = pd.read_csv('../input/megafon-uplift-competition/train (1).csv')
我们看到了许多隐藏的特征X_1-X_50,二元处理(以对象格式)和二元转换
# 查看训练数据的前几行
train.head()
5 rows × 53 columns
# 对训练数据集进行信息描述
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600000 entries, 0 to 599999
Data columns (total 53 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 600000 non-null int64
1 treatment_group 600000 non-null object
2 X_1 600000 non-null float64
3 X_2 600000 non-null float64
4 X_3 600000 non-null float64
5 X_4 600000 non-null float64
6 X_5 600000 non-null float64
7 X_6 600000 non-null float64
8 X_7 600000 non-null float64
9 X_8 600000 non-null float64
10 X_9 600000 non-null float64
11 X_10 600000 non-null float64
12 X_11 600000 non-null float64
13 X_12 600000 non-null float64
14 X_13 600000 non-null float64
15 X_14 600000 non-null float64
16 X_15 600000 non-null float64
17 X_16 600000 non-null float64
18 X_17 600000 non-null float64
19 X_18 600000 non-null float64
20 X_19 600000 non-null float64
21 X_20 600000 non-null float64
22 X_21 600000 non-null float64
23 X_22 600000 non-null float64
24 X_23 600000 non-null float64
25 X_24 600000 non-null float64
26 X_25 600000 non-null float64
27 X_26 600000 non-null float64
28 X_27 600000 non-null float64
29 X_28 600000 non-null float64
30 X_29 600000 non-null float64
31 X_30 600000 non-null float64
32 X_31 600000 non-null float64
33 X_32 600000 non-null float64
34 X_33 600000 non-null float64
35 X_34 600000 non-null float64
36 X_35 600000 non-null float64
37 X_36 600000 non-null float64
38 X_37 600000 non-null float64
39 X_38 600000 non-null float64
40 X_39 600000 non-null float64
41 X_40 600000 non-null float64
42 X_41 600000 non-null float64
43 X_42 600000 non-null float64
44 X_43 600000 non-null float64
45 X_44 600000 non-null float64
46 X_45 600000 non-null float64
47 X_46 600000 non-null float64
48 X_47 600000 non-null float64
49 X_48 600000 non-null float64
50 X_49 600000 non-null float64
51 X_50 600000 non-null float64
52 conversion 600000 non-null int64
dtypes: float64(50), int64(2), object(1)
memory usage: 242.6+ MB
特征没有标准化或归一化
# 使用describe()函数对训练数据集进行描述性统计分析
train.describe()
8 rows × 52 columns
然而,很有可能这些特征已经被清除。每个特征分布看起来都很正常。
# 给代码添加中文注释
# 设置行数和列数
rows, cols = 10, 5
# 创建一个包含多个子图的图形对象,并设置图形的大小
f, axs = plt.subplots(nrows=rows, ncols=cols, figsize=(20, 25))
# 设置图形的背景颜色为白色
f.set_facecolor("#fff")
# 设置特征数量为1
n_feat = 1
# 遍历每一行
for row in tqdm(range(rows)):
# 遍历每一列
for col in range(cols):
try:
# 绘制核密度估计图,并设置填充、透明度、线宽、边缘颜色等参数
sns.kdeplot(x=f'X_{n_feat}', fill=True, alpha=1, linewidth=3,
edgecolor="#264653", data=train, ax=axs[row, col], color='w')
# 设置子图的背景颜色为深绿色,并设置透明度
axs[row, col].patch.set_facecolor("#619b8a")
axs[row, col].patch.set_alpha(0.8)
# 设置子图的网格颜色和透明度
axs[row, col].grid(color="#264653", alpha=1, axis="both")
except IndexError: # 隐藏最后一个空图
axs[row, col].set_visible(False)
# 特征数量加1
n_feat += 1
# 显示图形
f.show()
0%| | 0/10 [00:00<?, ?it/s]
只是为了确保,请看qq图。
# 设置子图的行数和列数
rows, cols = 10, 5
# 创建一个包含子图的图像对象
f, axs = plt.subplots(nrows=rows, ncols=cols, figsize=(20, 25))
# 设置图像的背景颜色为白色
f.set_facecolor("#fff")
# 设置特征数量为1
n_feat = 1
# 遍历每一行
for row in tqdm(range(rows)):
# 遍历每一列
for col in range(cols):
try:
# 绘制核密度估计图
# sns.kdeplot(x=f'X_{n_feat}', fill=True, alpha=1, linewidth=3,
# edgecolor="#264653", data=train, ax=axs[row, col], color='w')
# 绘制QQ图
qqplot(train[f'X_{n_feat}'], ax=axs[row, col], line='q')
# 设置网格线的颜色为深绿色
axs[row, col].grid(color="#264653", alpha=1, axis="both")
# 如果索引超出范围,则隐藏最后一个空图
except IndexError:
axs[row, col].set_visible(False)
# 特征数量加1
n_feat += 1
# 显示图像
f.show()
0%| | 0/10 [00:00<?, ?it/s]
接下来让我们集中精力进行建模。
2. 指标
由于我们在研究之前没有Uplift,我们不能使用经典的Meta-Learners指标。然而,我们需要比较模型并了解它们的准确性。
1. Uplift@k
我们所需要做的就是对值进行排序(降序),并计算治疗组和对照组中目标变量(Y)的平均差异:
U p l i f t @ k = m e a n ( Y t r e a t m e n t @ k ) − m e a n ( Y c o n t r o l @ k ) Uplift@k = mean(Y^{treatment}@k) - mean(Y^{control}@k) Uplift@k=mean(Ytreatment@k)−mean(Ycontrol@k)
Y @ k Y@k Y@k - 前k%的目标变量
2. 按百分位数(十分位数)计算Uplift
相同的方法,但这里我们分别计算每个十分位数的差异
使用按百分位数计算的Uplift,我们可以计算加权平均Uplift:
加权平均Uplift$ = \frac{N^T_i * uplift_i}{\sum{N^T_i}} $
N i T N^T_i NiT - i百分位数中治疗组的大小
3. Uplift曲线和AUUC
Uplift曲线是一个依赖于对象数量的累积Uplift函数:
uplift curve i = ( Y t T N t T − Y t C N t C ) ( N t T + N t C ) \text{uplift curve}_i = (\frac{Y^T_t}{N^T_t}-\frac{Y^C_t}{N^C_t}) (N^T_t + N^C_t) uplift curvei=(NtTYtT−NtCYtC)(NtT+NtC)
其中 t − 累积对象数量 , N − T和C组的大小 \text{其中 } t - \text{累积对象数量}, N - \text{T和C组的大小} 其中 t−累积对象数量,N−T和C组的大小
AUUC - Unplift曲线下的面积是随机Uplift曲线和模型曲线之间的面积,通过理想Uplift曲线下的面积进行归一化
4. Qini曲线和AUQC
Qini曲线是另一种累积函数的方法:
qini curve i = Y t T − Y t C N t T N t C \text{qini curve}_i = Y^T_t-\frac{Y^C_tN^T_t}{N^C_t} qini curvei=YtT−NtCYtCNtT
AUQC或Qini系数 - Qini曲线下的面积是随机Qini曲线和模型曲线之间的面积,通过理想Qini曲线下的面积进行归一化
train.columns
Index(['id', 'treatment_group', 'X_1', 'X_2', 'X_3', 'X_4', 'X_5', 'X_6',
'X_7', 'X_8', 'X_9', 'X_10', 'X_11', 'X_12', 'X_13', 'X_14', 'X_15',
'X_16', 'X_17', 'X_18', 'X_19', 'X_20', 'X_21', 'X_22', 'X_23', 'X_24',
'X_25', 'X_26', 'X_27', 'X_28', 'X_29', 'X_30', 'X_31', 'X_32', 'X_33',
'X_34', 'X_35', 'X_36', 'X_37', 'X_38', 'X_39', 'X_40', 'X_41', 'X_42',
'X_43', 'X_44', 'X_45', 'X_46', 'X_47', 'X_48', 'X_49', 'X_50',
'conversion'],
dtype='object')
# 获取'treatment_group'列的唯一值
train['treatment_group'].unique()
array(['control', 'treatment'], dtype=object)
# 将'treatment_group'列中的值转换为0或1,如果值为'treatment'则转换为1,否则转换为0
train['treatment_group'] = train['treatment_group'].apply(lambda x: 1 if x=='treatment' else 0)
# 从sklearn库中导入train_test_split函数
from sklearn.model_selection import train_test_split
# 将train数据集的前100000行赋值给train变量
train = train[:100000]
# 从train数据集中选取名为'X_i'的特征列,其中i的取值范围为1到50,并将结果赋值给X变量
X = train[[f'X_{i}' for i in range(1, 51)]]
# 从train数据集中选取名为'treatment_group'的特征列,并将结果赋值给treatment变量
treatment = train['treatment_group']
# 从train数据集中选取名为'conversion'的特征列,并将结果赋值给y变量
y = train['conversion']
# 使用train_test_split函数将X、y和treatment按照指定的比例划分为训练集和验证集,并将划分结果分别赋值给X_train、X_val、y_train、y_val、treatment_train和treatment_val变量
X_train, X_val, y_train, y_val, treatment_train, treatment_val = train_test_split(X, y, treatment, test_size=0.2)
3. 元学习者
3.1 S-Learner
3.1 S-学习者
S-learner的主要思想是使用特征、二进制处理(W)和二进制目标动作(Y)训练一个模型。然后使用常数W=1和W=0对测试数据进行预测。差异即为提升效果。
好消息是我们可以使用经典的机器学习分类器!让我们使用xgboost来做吧。你也可以尝试其他分类器并比较结果。
# 定义一个函数get_metrics,接受三个参数y_val, uplift, treatment_val
def get_metrics(y_val, uplift, treatment_val):
# 计算指标
# 计算前30%的提升值。按照组别排序控制组和处理组。整体排序。
upliftk = uplift_at_k(y_true=y_val, uplift=uplift, treatment=treatment_val, strategy='by_group', k=0.3)
upliftk_all = uplift_at_k(y_true=y_val, uplift=uplift, treatment=treatment_val, strategy='overall', k=0.3)
# 计算Qini系数
qini_coef = qini_auc_score(y_true=y_val, uplift=uplift, treatment=treatment_val)
# 默认策略 - 整体排序
# 计算提升曲线下面积
uplift_auc = uplift_auc_score(y_true=y_val, uplift=uplift, treatment=treatment_val)
# 计算加权平均提升值
wau = weighted_average_uplift(y_true=y_val, uplift=uplift, treatment=treatment_val, strategy='by_group')
wau_all = weighted_average_uplift(y_true=y_val, uplift=uplift, treatment=treatment_val)
# 打印结果
print(f'uplift at top 30% by group: {upliftk:.2f} by overall: {upliftk_all:.2f}\n',
f'Weighted average uplift by group: {wau:.2f} by overall: {wau_all:.2f}\n',
f'AUUC by group: {uplift_auc:.2f}\n',
f'AUQC by group: {qini_coef:.2f}\n')
# 返回一个包含指标结果的字典
return {'uplift@30': upliftk, 'uplift@30_all': upliftk_all, 'AUQC': qini_coef, 'AUUC': uplift_auc,
'WAU': wau, 'WAU_all': wau_all}
# 创建一个XGBoost分类器模型,设置随机种子为42,目标函数为二元逻辑回归,禁用标签编码
xgb_sm = xgb.XGBClassifier(random_state=42, objective='binary:logistic', use_label_encoder=False)
# 创建一个SoloModel对象,使用上面创建的XGBoost分类器模型作为估计器
sm = SoloModel(estimator=xgb_sm)
# 使用训练数据集X_train、y_train和treatment_train来拟合SoloModel模型
sm = sm.fit(X_train, y_train, treatment_train, estimator_fit_params={})
# 使用拟合好的SoloModel模型对验证数据集X_val进行预测
uplift_sm = sm.predict(X_val)
# 使用get_metrics函数计算验证数据集的评估指标,包括y_val、uplift_sm和treatment_val
res = get_metrics(y_val, uplift_sm, treatment_val)
[12:14:57] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
uplift at top 30% by group: 0.18 by overall: 0.18
Weighted average uplift by group: 0.04 by overall: 0.04
AUUC by group: 0.15
AUQC by group: 0.21
3.2 T-Learner
3.2 T学习器
T-learner的主要思想是训练两个独立的模型:一个基于治疗后的观察数据(T),另一个基于对照组数据(C)。提升效果是模型T和模型C在数据上的预测差异。
# 初始化两个xgboost分类器,分别用于处理treatment组和control组
xgb_T = xgb.XGBClassifier(random_state=42, objective='binary:logistic', use_label_encoder=False)
xgb_C = xgb.XGBClassifier(random_state=42, objective='binary:logistic', use_label_encoder=False)
# 初始化TwoModels类,将treatment组和control组的分类器传入
sm = TwoModels(estimator_trmnt=xgb_T, estimator_ctrl=xgb_C)
# 使用训练数据拟合模型
sm = sm.fit(X_train, y_train, treatment_train, estimator_trmnt_fit_params={}, estimator_ctrl_fit_params={})
# 对验证集进行预测
uplift_sm = sm.predict(X_val)
# 计算模型的评估指标
res = get_metrics(y_val, uplift_sm, treatment_val)
[12:15:47] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[12:16:11] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
uplift at top 30% by group: 0.17 by overall: 0.17
Weighted average uplift by group: 0.04 by overall: 0.04
AUUC by group: 0.13
AUQC by group: 0.18
结果比S-learner稍差。
3.3 T-Learner依赖模型
T-learner与依赖模型的主要思想是使用相反模型的预测(概率)来训练T或C模型。
这种方法是从分类器链方法中采用的:https://scikit-learn.org/stable/auto_examples/multioutput/plot_classifier_chain_yeast.html
这种方法有两种可能的实现方式:基于C模型中的T-probs和基于T模型中的C-probs:
- u p l i f t i = P T ( x i , P C ( X ) ) − P C ( x i ) uplift_i = P^T(x_i, P^C(X)) - P^C(x_i) uplifti=PT(xi,PC(X))−PC(xi)
- u p l i f t i = P T ( x i ) − P C ( x i , P T ( x i ) ) uplift_i = P^T(x_i) - P^C(x_i, P^T(x_i)) uplifti=PT(xi)−PC(xi,PT(xi))
第一种方法:
# 创建两个XGBClassifier对象,分别作为treatment模型和control模型
xgb_T = xgb.XGBClassifier(random_state=42, objective='binary:logistic', use_label_encoder=False)
xgb_C = xgb.XGBClassifier(random_state=42, objective='binary:logistic', use_label_encoder=False)
# 创建TwoModels对象,将treatment模型和control模型传入,并指定方法为'ddr_control'
sm = TwoModels(estimator_trmnt=xgb_T, estimator_ctrl=xgb_C, method='ddr_control')
# 使用训练数据拟合TwoModels对象
sm = sm.fit(X_train, y_train, treatment_train, estimator_trmnt_fit_params={}, estimator_ctrl_fit_params={})
# 使用拟合好的TwoModels对象对验证数据进行预测
uplift_sm = sm.predict(X_val)
# 使用预测结果和验证数据计算评估指标
res = get_metrics(y_val, uplift_sm, treatment_val)
[12:16:37] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[12:17:02] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
uplift at top 30% by group: 0.17 by overall: 0.17
Weighted average uplift by group: 0.04 by overall: 0.04
AUUC by group: 0.12
AUQC by group: 0.18
第二种方法:
# 创建两个XGBoost分类器,用于处理treatment组和control组
xgb_T = xgb.XGBClassifier(random_state=42, objective='binary:logistic', use_label_encoder=False)
xgb_C = xgb.XGBClassifier(random_state=42, objective='binary:logistic', use_label_encoder=False)
# 创建TwoModels对象,使用xgb_T和xgb_C作为估计器,并选择ddr_treatment方法
sm = TwoModels(estimator_trmnt=xgb_T, estimator_ctrl=xgb_C, method='ddr_treatment')
# 使用训练数据拟合TwoModels对象
sm = sm.fit(X_train, y_train, treatment_train, estimator_trmnt_fit_params={}, estimator_ctrl_fit_params={})
# 对验证数据进行预测
uplift_sm = sm.predict(X_val)
# 计算模型的评估指标
res = get_metrics(y_val, uplift_sm, treatment_val)
[12:17:27] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[12:17:51] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
uplift at top 30% by group: 0.17 by overall: 0.17
Weighted average uplift by group: 0.04 by overall: 0.04
AUUC by group: 0.13
AUQC by group: 0.19