我正在尝试根据一个binned类别对pandas数据框进行子集。(我知道您可以根据值本身进行子集,这只是一个不同问题的表示,我实际上需要对数据进行装箱!)我想我遗漏了一些关于子集的内容,但是在文档中找不到。下面是一个例子:
import numpy as np
import pandas as pd
np.random.seed(9876)
# Generating random data for binning.
bin_step = 0.5
random_data = np.random.uniform(low = 0, high = 10, size = 30)
# Generating bin ranges
bin_ranges = np.arange(start = random_data.min(),
stop = random_data.max() + random_data.max()*0.1,
step = bin_step)
# Cutting the random data into predefined bins.
bins = pd.cut(random_data.tolist(),
bin_ranges,
right = True,
include_lowest = True)
# Aggregating into a pandas DataFrame
random_data_pd = pd.Series(random_data.tolist(), name = 'values')
bins_transformed = pd.Series(bins, name = 'bins')
df = pd.concat([bins_transformed, random_data_pd], axis = 1)
当对容器进行子集设置时,例如
(5.086, 5.586]
,它将返回所有False
。为什么这不是子集?df.bins == '(5.086, 5.586]' #returns all false.
最佳答案
如果我理解正确的话,原因是你对不同的类型使用了==
,而不是pd.Interval
。请查查我的例子。
print(type(df.bins[0]))
<class 'pandas._libs.interval.Interval'>
print(df.bins)
print(df.bins == pd.Interval(5.1, 5.2))
0 (1.586, 2.086]
1 (6.086, 6.586]
2 (8.586, 9.086]
3 (7.586, 8.086]
4 (5.086, 5.586]
5 (0.585, 1.086]
6 (4.586, 5.086]
7 (1.086, 1.586]
8 (9.086, 9.586]
9 (4.586, 5.086]
10 (1.586, 2.086]
11 (1.086, 1.586]
12 (2.586, 3.086]
13 (2.586, 3.086]
14 (1.086, 1.586]
15 (8.086, 8.586]
16 (7.086, 7.586]
17 (6.586, 7.086]
18 (8.586, 9.086]
19 (7.586, 8.086]
20 (7.586, 8.086]
21 (0.585, 1.086]
22 (4.586, 5.086]
23 (9.086, 9.586]
24 (8.086, 8.586]
25 (6.586, 7.086]
26 (5.086, 5.586]
27 (6.586, 7.086]
28 (5.086, 5.586]
29 (9.086, 9.586]
Name: bins, dtype: category
Categories (19, interval[float64]): [(0.585, 1.086] < (1.086, 1.586] < (1.586, 2.086] <
(2.086, 2.586] ... (8.086, 8.586] < (8.586, 9.086] <
(9.086, 9.586] < (9.586, 10.086]]
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 True
27 False
28 True
29 False
Name: bins, dtype: bool
子集。。。
print(df[df.bins == pd.Interval(5.1, 5.2)])
bins values
4 (5.086, 5.586] 5.132422
26 (5.086, 5.586] 5.309666
28 (5.086, 5.586] 5.574920
关于python - 基于bin子集pandas DataFrame,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/45726167/