对样本大小大于DataFrame长度的行进行采样 | 对样本大小大于DataFrame长度的行进行采样

本文介绍了对样本大小大于DataFrame长度的行进行采样的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

系统要求我根据旧数据生成一个新变量。基本上，我要问的是，我是从原始值中随机抽取值（通过使用 random 函数），并且观察值至少是旧值的10倍，然后将其保存为新变量。

I'm being asked to generate a new variable based on the data from an old one. Basically, what is being asked is that I take values at random (by using the random function) from the original one and have at least 10x as many observations as the old one, and then save this as a new variable.

这是我的数据集：

我想使用的变量是 area

这是我的尝试，但它给我一个模块对象不可调用错误：

This is my attempt but it is giving me a module object is not callable error:

import pandas as pd
import random as rand

dataFrame = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv")

area = dataFrame['area']

random_area = rand(area)

print(random_area)

解决方案

您可以使用函数：

You can use the sample function with replace=True:

df = df.sample(n=len(df) * 10, replace=True)

或者，仅对区域列进行采样，请使用

Or, to sample only the area column, use

area = df.area.sample(n=len(df) * 10, replace=True)

另一个选项将涉及，看起来像这样：

Another option would involve np.random.choice, and would look something like:

df = df.iloc[np.random.choice(len(df), len(df) * 10)]

这个想法是从0- len（df）-1 生成随机索引。第一个参数指定上限，第二个参数（ len（df）* 10 ）指定要生成的索引数。然后，我们使用生成的索引来索引 df 。

The idea is to generate random indices from 0-len(df)-1. The first argument specifies the upper bound and the second (len(df) * 10) specifies the number of indices to generate. We then use the generated indices to index into df.

如果您只想获取区域，就足够了。

If you just want to get the area, this is sufficient.

area = df.iloc[np.random.choice(len(df), len(df) * 10), df.columns.get_loc('area')]

Index.get_loc 将 iloc 的位置。

df = pd.DataFrame({'A': list('aab'), 'B': list('123')})
df
   A  B
0  a  1
1  a  2
2  b  3

# Sample 3 times the original size
df.sample(n=len(df) * 3, replace=True)

   A  B
2  b  3
1  a  2
1  a  2
2  b  3
1  a  2
0  a  1
0  a  1
2  b  3
2  b  3

df.iloc[np.random.choice(len(df), len(df) * 3)]

   A  B
0  a  1
1  a  2
1  a  2
0  a  1
2  b  3
0  a  1
0  a  1
0  a  1
2  b  3

这篇关于对样本大小大于DataFrame长度的行进行采样的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！