问题描述
我有一个数据框列,用于指定用户执行活动的次数.例如.
>>>df['活动计数']用户活动计数用户 0 220用户 1 190用户 2 105用户 3 109用户 4 271用户 5 265...用户 95 64用户 96 15用户97 168用户 98 251用户 99 278名称:ActivityCount,长度:100,数据类型:int32>>>活动 = 排序(df['ActivityCount'].unique())[9, 15, 16, 17, 20, 23, 25, 26, 28, 31, 33, 34, 36, 38, 39, 43, 49, 57, 59, 64, 65, 71, 76, 77, 7,83, 88, 94, 95, 100, 105, 109, 110, 111, 115, 116, 117, 120, 132, 137, 138, 139, 140, 141, 4,5,15, 15, 15, 14162, 168, 177, 180, 182, 186, 190, 192, 194, 197, 203, 212, 213, 220, 223, 231, 232, 238, 4, 5, 2, 5, 20, 24, 5, 2, 5, 20265, 268, 269, 271, 272, 276, 278, 282, 283, 285, 290]根据他们的 ActivityCount,我必须将用户分为 5 个不同的类别,例如 A、B、C、D
和 E
.活动计数范围不时变化.在上面的例子中,它大约介于 (9-290)
(最低和最高系列)之间,它可以是 (5-500)
或 (5 到 30)
.在上面的示例中,我可以将最大活动数除以 5,然后将每个用户在 58 (from 290/5)
范围内进行分类,例如 Range A: 0-58
、范围 B:59-116
、范围 C:117-174
...等
有没有其他方法可以使用 pandas 或 numpy 来实现这一点,以便我可以直接将列归入给定类别?预期输出:-
>>>df用户 ActivityCount 类别/范围用户0 220 D用户 1 190 D用户 2 105 B用户 3 109 B用户 4 271 E用户 5 265 E...用户95 64 BUser96 15 A用户97 168 C用户 98 251 E用户 99 278 E最自然的方法是将数据拆分为 5 个数量,然后根据这些数量将数据拆分为多个 bin.幸运的是,pandas 可以让你轻松做到这一点:
df["category"] = pd.cut(df.Activity, 5, labels= ["a","b", "c", "d", "e"])
输出类似于:
活动类别34 115 羽15 43 一个57 192 天78 271 电子26 88羽6 25 一55 186 天63 220 天1 15 一76 268 电子
另一种观点 - 聚类
在上述方法中,我们将数据分成 5 个 bin,其中不同 bin 的大小相等.另一种更复杂的方法是将数据分成 5 个集群,并旨在使每个集群中的数据点尽可能相似.在机器学习中,这被称为聚类/分类问题.
一种经典的聚类算法是 k-means.它通常用于具有多个维度(例如每月活动、年龄、性别等)的数据.因此,这是一个非常简单的聚类案例.
在这种情况下,k-means聚类可以通过以下方式完成:
导入scipy从 scipy.cluster.vq 导入 vq、kmeans、whitendf = pd.DataFrame({"Activity": l})features = np.array([[x] for x in df.Activity])白化 = 白化(特征)码本,失真 = kmeans(白化,5)代码,dist = vq(白化,码本)df["类别"] = 代码
输出如下:
活动类别40 138 179 272 072 255 013 38 341 139 165 231 026 88 259 197 476 268 045 145 1
注意事项:
- 类别的标签是随机的.在这种情况下,标签 '2' 指的是比 lavel '1' 更高的活动.
- 我没有将标签从 0-4 迁移到 A-E.这可以使用熊猫的
map
轻松完成.
I have a dataframe column which specifies how many times a user has performed an activity.eg.
>>> df['ActivityCount']
Users ActivityCount
User0 220
User1 190
User2 105
User3 109
User4 271
User5 265
...
User95 64
User96 15
User97 168
User98 251
User99 278
Name: ActivityCount, Length: 100, dtype: int32
>>> activities = sorted(df['ActivityCount'].unique())
[9, 15, 16, 17, 20, 23, 25, 26, 28, 31, 33, 34, 36, 38, 39, 43, 49, 57, 59, 64, 65, 71, 76, 77, 78,
83, 88, 94, 95, 100, 105, 109, 110, 111, 115, 116, 117, 120, 132, 137, 138, 139, 140, 141, 144, 145, 148, 153, 155, 157, 162, 168, 177, 180, 182, 186, 190, 192, 194, 197, 203, 212, 213, 220, 223, 231, 232, 238, 240, 244, 247, 251, 255, 258, 260, 265, 268, 269, 271, 272, 276, 278, 282, 283, 285, 290]
According to their ActivityCount, I have to divide users into 5 different categories eg A, B, C, D
and E
.Activity Count range varies from time to time. In the above example it's approx in-between (9-290)
(lowest and highest of the series), it could be (5-500)
or (5 to 30)
.In above example, I can take the max number of activities and divide it by 5 and categorize each user between the range of 58 (from 290/5)
like Range A: 0-58
, Range B: 59-116
, Range C: 117-174
...etc
Is there any other way to achieve this using pandas or numpy, so that I can directly categorize the column in the given categories?Expected output: -
>>> df
Users ActivityCount Category/Range
User0 220 D
User1 190 D
User2 105 B
User3 109 B
User4 271 E
User5 265 E
...
User95 64 B
User96 15 A
User97 168 C
User98 251 E
User99 278 E
The natural way to do that would be to split the data into 5 quanties, and then split the data into bins based on these quantities. Luckily, pandas allows you do easily do that:
df["category"] = pd.cut(df.Activity, 5, labels= ["a","b", "c", "d", "e"])
The output is something like:
Activity Category
34 115 b
15 43 a
57 192 d
78 271 e
26 88 b
6 25 a
55 186 d
63 220 d
1 15 a
76 268 e
An alternative view - clustering
In the above method, we've split the data into 5 bins, where the sizes of the different bins are equal. An alternative, more sophisticated approach, would be to split the data into 5 clusters and aim to have the data points in each cluster as similar to each other as possible. In machine learning, this is known as a clustering / classification problem.
One classic clustering algorithm is k-means. It's typically used for data with multiple dimensions (e.g. monthly activity, age, gender, etc.) This is, therefore, a very simplistic case of clustering.
In this case, k-means clustering can be done in the following way:
import scipy
from scipy.cluster.vq import vq, kmeans, whiten
df = pd.DataFrame({"Activity": l})
features = np.array([[x] for x in df.Activity])
whitened = whiten(features)
codebook, distortion = kmeans(whitened, 5)
code, dist = vq(whitened, codebook)
df["Category"] = code
And the output looks like:
Activity Category
40 138 1
79 272 0
72 255 0
13 38 3
41 139 1
65 231 0
26 88 2
59 197 4
76 268 0
45 145 1
A couple of notes:
- The labels of the categories are random. In this case label '2' refers to higher activity than lavel '1'.
- I didn't migrate the labels from 0-4 to A-E. This can easily be done using pandas'
map
.
这篇关于Python生成特定长度的唯一范围并对其进行分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!