问题描述
我有一个 DataFrame
df
,其中包含一列, category
用以下代码创建:
I have a DataFrame
df
with one column, category
created with the code below:
import pandas as pd
import random as rand
from string import ascii_uppercase
rand.seed(1010)
df = pd.DataFrame()
values = list()
for i in range(0,1000):
category = (''.join(rand.choice(ascii_uppercase) for i in range(1)))
values.append(category)
df['category'] = values
每个值的频率计数为:
df['category'].value_counts()
Out[95]:
P 54
B 50
T 48
V 46
I 46
R 45
F 43
K 43
U 41
C 40
W 39
E 39
J 39
X 37
M 37
Q 35
Y 35
Z 34
O 33
D 33
H 32
G 32
L 31
N 31
S 29
我想在 df ['category']
列中创建一个称为其他"的新值,并分配 df ['category']
的所有值 value_count
小于 35
.
I would like to make a new value in df['category']
column called "other" and assign all values of df['category']
that have a value_count
less than 35
.
有人可以帮我这个忙吗?
Can someone help me out with this?
让我知道您是否还需要我
Let me know if you need anything more from me
@EdChum提出的解决方案的编辑
import pandas as pd
import random as rand
from string import ascii_uppercase
rand.seed(1010)
df = pd.DataFrame()
values = list()
for i in range(0,1000):
category = (''.join(rand.choice(ascii_uppercase) for i in range(1)))
values.append(category)
df['category'] = values
df['category'].value_counts()
df.loc[df['category'].isin((df['category'].value_counts([df['category'].value_counts() < 35]).index), 'category'] = 'other'
File "<stdin>", line 1
df.loc[df['category'].isin((df['category'].value_counts()[df['category'].value_counts() < 35]).index), 'category'] = 'other'
^
SyntaxError: invalid syntax
请注意,我正在Spyder IDE上使用Python 2.7(我在iPython和Python控制台窗口中尝试了建议的解决方案)
Note that I am using Python 2.7 on the Spyder IDE (I tried the proposed solution in iPython and Python console windows)
推荐答案
您可以使用 value_counts
生成布尔掩码来掩盖值,然后使用 loc将其设置为其他"
:
You can use value_counts
to generate a boolean mask to mask the values and then set these to 'other' using loc
:
In [71]:
df.loc[df['category'].isin((df['category'].value_counts()[df['category'].value_counts() < 35]).index), 'category'] = 'other'
df
Out[71]:
category
0 other
1 other
2 A
3 V
4 U
5 D
6 T
7 G
8 S
9 H
10 other
11 other
12 other
13 other
14 S
15 D
16 B
17 P
18 B
19 other
20 other
21 F
22 H
23 G
24 P
25 other
26 M
27 V
28 T
29 A
.. ...
970 E
971 D
972 other
973 P
974 V
975 S
976 E
977 other
978 H
979 V
980 O
981 other
982 O
983 Z
984 other
985 P
986 P
987 other
988 O
989 other
990 P
991 X
992 E
993 V
994 B
995 P
996 B
997 P
998 Q
999 X
[1000 rows x 1 columns]
打破以上:
In [74]:
df['category'].value_counts() < 35
Out[74]:
W False
B False
C False
V False
H False
P False
T False
R False
U False
K False
E False
Y False
M False
F False
O False
A False
D False
Q False
N True
J True
S True
G True
Z True
I True
X True
L True
Name: category, dtype: bool
In [76]:
df['category'].value_counts()[df['category'].value_counts() < 35]
Out[76]:
N 34
J 33
S 33
G 33
Z 32
I 31
X 31
L 30
Name: category, dtype: int64
然后我们可以对 .index
值使用 isin
并将行设置为其他"
we can then use isin
against the .index
values and set the rows to 'other'
这篇关于有条件地创建“其他"分类列中的类别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!