本文介绍了有条件地创建“其他"分类列中的类别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 DataFrame df ,其中包含一列, category 用以下代码创建:

I have a DataFrame df with one column, category created with the code below:

import pandas as pd
import random as rand
from string import ascii_uppercase

rand.seed(1010)

df = pd.DataFrame()
values = list()
for i in range(0,1000):
    category = (''.join(rand.choice(ascii_uppercase) for i in range(1)))
    values.append(category)

df['category'] = values

每个值的频率计数为:

df['category'].value_counts()
Out[95]:
P    54
B    50
T    48
V    46
I    46
R    45
F    43
K    43
U    41
C    40
W    39
E    39
J    39
X    37
M    37
Q    35
Y    35
Z    34
O    33
D    33
H    32
G    32
L    31
N    31
S    29

我想在 df ['category'] 列中创建一个称为其他"的新值,并分配 df ['category'] 的所有值 value_count 小于 35 .

I would like to make a new value in df['category'] column called "other" and assign all values of df['category'] that have a value_count less than 35.

有人可以帮我这个忙吗?

Can someone help me out with this?

让我知道您是否还需要我

Let me know if you need anything more from me

@EdChum提出的解决方案的编辑

import pandas as pd
import random as rand
from string import ascii_uppercase

rand.seed(1010)

df = pd.DataFrame()
values = list()
for i in range(0,1000):
    category = (''.join(rand.choice(ascii_uppercase) for i in range(1)))
    values.append(category)

df['category'] = values
df['category'].value_counts()

df.loc[df['category'].isin((df['category'].value_counts([df['category'].value_‌​counts() < 35]).index), 'category'] = 'other'

  File "<stdin>", line 1
    df.loc[df['category'].isin((df['category'].value_counts()[df['category'].value_‌​counts() < 35]).index), 'category'] = 'other'
                                                                                   ^
SyntaxError: invalid syntax

请注意,我正在Spyder IDE上使用Python 2.7(我在iPython和Python控制台窗口中尝试了建议的解决方案)

Note that I am using Python 2.7 on the Spyder IDE (I tried the proposed solution in iPython and Python console windows)

推荐答案

您可以使用 value_counts 生成布尔掩码来掩盖值,然后使用 loc将其设置为其他":

You can use value_counts to generate a boolean mask to mask the values and then set these to 'other' using loc:

In [71]:
df.loc[df['category'].isin((df['category'].value_counts()[df['category'].value_counts() < 35]).index), 'category'] = 'other'
df

Out[71]:
    category
0      other
1      other
2          A
3          V
4          U
5          D
6          T
7          G
8          S
9          H
10     other
11     other
12     other
13     other
14         S
15         D
16         B
17         P
18         B
19     other
20     other
21         F
22         H
23         G
24         P
25     other
26         M
27         V
28         T
29         A
..       ...
970        E
971        D
972    other
973        P
974        V
975        S
976        E
977    other
978        H
979        V
980        O
981    other
982        O
983        Z
984    other
985        P
986        P
987    other
988        O
989    other
990        P
991        X
992        E
993        V
994        B
995        P
996        B
997        P
998        Q
999        X

[1000 rows x 1 columns]

打破以上:

In [74]:
df['category'].value_counts() < 35

Out[74]:
W    False
B    False
C    False
V    False
H    False
P    False
T    False
R    False
U    False
K    False
E    False
Y    False
M    False
F    False
O    False
A    False
D    False
Q    False
N     True
J     True
S     True
G     True
Z     True
I     True
X     True
L     True
Name: category, dtype: bool

In [76]:
df['category'].value_counts()[df['category'].value_counts() < 35]

Out[76]:
N    34
J    33
S    33
G    33
Z    32
I    31
X    31
L    30
Name: category, dtype: int64

然后我们可以对 .index 值使用 isin 并将行设置为其他"

we can then use isin against the .index values and set the rows to 'other'

这篇关于有条件地创建“其他"分类列中的类别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-18 09:02