我有一个 Pandas DataFrame,其外观大致如下:
cli_id | X1 | X2 | X3 | ... | Xn | Y |
----------------------------------------
123 | 1 | A | XX | ... | 4 | 0.1 |
456 | 2 | B | XY | ... | 5 | 0.2 |
789 | 1 | B | XY | ... | 5 | 0.3 |
101 | 2 | A | XX | ... | 4 | 0.1 |
...
我有客户端ID,很少有分类属性,Y是事件的概率,其值从0到1乘以0.1。
我需要在每组(所以10折)Y大小为200 的Y中进行分层抽样
在分为训练/测试时,我经常使用它来抽取分层样本:
def stratifiedSplit(X,y,size):
sss = StratifiedShuffleSplit(y, n_iter=1, test_size=size, random_state=0)
for train_index, test_index in sss:
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
return X_train, X_test, y_train, y_test
但是在这种情况下,我不知道如何修改它。
最佳答案
我不确定您是否是这个意思:
strats = []
for k in range(11):
y_val = k*0.1
dummy_df = your_df[your_df['Y'] == y_val]
stats.append( dummy_df.sample(200) )
这将使虚拟数据帧仅包含所需的
Y
值,然后获取200个样本。确定,因此您需要不同的块以具有相同的结构。我想这有点难,这是我的处理方式:
首先,我将得到
X1
的直方图:hist, edges = np.histogram(your_df['X1'], bins=np.linespace(min_x, max_x, nbins))
我们现在有了一个带有
nbins
bins的直方图。现在的策略是根据
X1
的值绘制一定数量的行。我们将从具有更多观察结果的容器中获取更多信息,而从具有更少观察值的容器中获取更少信息,从而保留X
的结构。特别是,每个垃圾箱的相对贡献应为:
rel = [float(i) / sum(hist) for i in hist]
这将类似于
[0.1, 0.2, 0.1, 0.3, 0.3]
如果需要200个样本,则需要绘制:
draws_in_bin = [int(i*200) for i in rel]
现在我们知道从每个箱中抽取多少个观测值:
strats = []
for k in range(11):
y_val = k*0.1
#get a dataframe for every value of Y
dummy_df = your_df[your_df['Y'] == y_val]
bin_strat = []
for left_edge, right_edge, n_draws in zip(edges[:-1], edges[1:], draws_in_bin):
bin_df = dummy_df[ (dummy_df['X1']> left_edge)
& (dummy_df['X1']< right_edge) ]
bin_strat.append(bin_df.sample(n_draws))
# this takes the right number of draws out
# of the X1 bin where we currently are
# Note that every element of bin_strat is a dataframe
# with a number of entries that corresponds to the
# structure of draws_in_bin
#
#concatenate the dataframes for every bin and append to the list
strats.append( pd.concat(bin_strat) )
关于python - Pandas 的分层 sample ,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/41035187/