问题描述
这个问题遵循这个问题(我被其他贡献者要求将其作为新问题发布).
This question follows this question (I was asked to post it as a new question by other contributors).
我们有这个模拟 df:
We have this mock df:
df = pd.DataFrame({
'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'country': ['USA', 'USA', 'USA', 'USA', 'USA', 'Canada', 'Canada', 'Canada', 'USA', 'Canada']
})
假设我想从美国随机抽取 4 行,从加拿大随机抽取 2 行.我试过了:
Let's say I want to sample 4 random rows from USA and 2 random rows from Canada. I've tried:
df.groupby("country").sample(n=[4, 2])
这会返回一个错误.错误可能是使用方括号.那么如何为每组指定不同的n呢?
This returns an error. The mistake is probably the use of square brackets. How to specify different n for each group, then?
请注意,理想情况下,我需要使用 df.groupby.sample 的解决方案.另请注意,我需要指定 n,而不是文档中的比例或重量(请参阅 此处).最后注意我还需要设置一个种子.谢谢
Note ideally I need a solution using df.groupby.sample. Also note I need to specify n, not proportion or weight as in documentation (see here). Finally note I also need to set a seed. Thank you
推荐答案
您可以group
country
然后.sample
每组分别所在的样本数to take可以从字典中获取,最后.concat
所有采样组:
You can group
the dataframe on country
then .sample
each group separately where the number of samples to take can be obtained from the dictionary, finally .concat
all the sampled groups:
d = {'USA': 4, 'Canada': 2} # mapping dict
pd.concat([g.sample(d[k]) for k, g in df.groupby('country', sort=False)])
id country
0 1 USA
4 5 USA
1 2 USA
2 3 USA
6 7 Canada
9 10 Canada
这篇关于按组随机抽样:如何指定n,而不是权重?(使用 DataFrameGroupBy.sample)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!