考虑一个具有倾斜类分布的总体,如
ErrorType Samples
1 XXXXXXXXXXXXXXX
2 XXXXXXXX
3 XX
4 XXX
5 XXXXXXXXXXXX
我想随机抽取40个班级中的20个,而不是对参与人数较少的班级进行欠采样。例如,在上面的例子中,我想按如下方式取样
ErrorType Samples
1 XXXXX|XXXXXXXXXX
2 XXXXX|XXX
3 XX***|
4 XXX**|
5 XXXXX|XXXXXXX
即-1和-2和-3型的5,3型的2和4型的3
这保证了我的样品尺寸接近我的目标,即20个样品
没有一个班级有参与不足的ESP课程-3和-4。
我最终编写了一个迂回的代码,但我相信有一种更简单的方法可以利用pandas方法或一些sklearn函数。
sample_size = 20 # Just for the example
# Determine the average participaction per error types
avg_items = sample_size / len(df.ErrorType.unique())
value_counts = df.ErrorType.value_counts()
less_than_avg = value_counts[value_counts < avg_items]
offset = avg_items * len(value_counts[value_counts < avg_items]) - sum(less_than_avg)
offset_per_item = offset / (len(value_counts) - len(less_than_avg))
adj_avg = int(non_act_count / len(value_counts) + offset_per_item)
df = df.groupby(['ErrorType'],
group_keys=False).apply(lambda g: g.sample(min(adj_avg, len(g)))))
最佳答案
您可以使用helper列查找长度大于样本大小的样本,并使用pd.Series.sample
即
例子:
df = pd.DataFrame({'ErrorType':[1,2,3,4,5],
'Samples':[np.arange(100),np.arange(10),np.arange(3),np.arange(2),np.arange(100)]})
df['new'] =df['Samples'].str.len().where(df['Samples'].str.len()<5,5)
# this is let us know how many samples can be extracted per row
#0 5
#1 5
#2 3
#3 2
#4 5
Name: new, dtype: int64
# Sampling based on newly obtained column i.e
df.apply(lambda x : pd.Series(x['Samples']).sample(x['new']).tolist(),1)
0 [52, 81, 43, 60, 46]
1 [8, 7, 0, 9, 1]
2 [2, 1, 0]
3 [1, 0]
4 [29, 24, 16, 15, 69]
Name: sample2, dtype: object
我编写了一个函数来返回带有thresh的样本大小,即
def get_thres_arr(sample_size,sample_length):
thresh = sample_length.min()
size = np.array([thresh]*len(sample_length))
sum_of_size = sum(size)
while sum_of_size< sample_size:
# If the lenght is more than threshold then increase the thresh by 1 i.e
size = np.where(sample_length>thresh,thresh+1,sample_length)
sum_of_size = sum(size)
#increment threshold
thresh+=1
return size
df = pd.DataFrame({'ErrorType':[1,2,3,4,5,1,7,9,4,5],
'Samples':[np.arange(100),np.arange(10),np.arange(3),np.arange(2),np.arange(100),np.arange(100),np.arange(10),np.arange(3),np.arange(2),np.arange(100)]})
ndf = pd.DataFrame({'ErrorType':[1,2,3,4,5,6],
'Samples':[np.arange(100),np.arange(10),np.arange(3),np.arange(1),np.arange(2),np.arange(100)]})
get_thres_arr(20,ndf['Samples'].str.len())
#array([5, 5, 3, 1, 2, 5])
get_thres_arr(20,df['Samples'].str.len())
#array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
现在您可以使用以下尺寸:
df['new'] = get_thres_arr(20,df['Samples'].str.len())
df.apply(lambda x : pd.Series(x['Samples']).sample(x['new']).tolist(),1)
0 [64, 89]
1 [4, 0]
2 [0, 1]
3 [1, 0]
4 [41, 80]
5 [25, 84]
6 [4, 0]
7 [2, 0]
8 [1, 0]
9 [34, 1]
希望有帮助。