Python生成特定长度的唯一范围并对其进行分类

本文介绍了Python生成特定长度的唯一范围并对其进行分类的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据框列，用于指定用户执行活动的次数.例如.

>>>df['活动计数']用户活动计数用户 0 220用户 1 190用户 2 105用户 3 109用户 4 271用户 5 265...用户 95 64用户 96 15用户97 168用户 98 251用户 99 278名称:ActivityCount，长度:100，数据类型:int32>>>活动 = 排序(df['ActivityCount'].unique())[9, 15, 16, 17, 20, 23, 25, 26, 28, 31, 33, 34, 36, 38, 39, 43, 49, 57, 59, 64, 65, 71, 76, 77, 7,83, 88, 94, 95, 100, 105, 109, 110, 111, 115, 116, 117, 120, 132, 137, 138, 139, 140, 141, 4,5,15, 15, 15, 14162, 168, 177, 180, 182, 186, 190, 192, 194, 197, 203, 212, 213, 220, 223, 231, 232, 238, 4, 5, 2, 5, 20, 24, 5, 2, 5, 20265, 268, 269, 271, 272, 276, 278, 282, 283, 285, 290]

根据他们的 ActivityCount，我必须将用户分为 5 个不同的类别，例如 A、B、C、D 和 E.活动计数范围不时变化.在上面的例子中，它大约介于 (9-290)(最低和最高系列)之间，它可以是 (5-500) 或 (5 到 30).在上面的示例中，我可以将最大活动数除以 5，然后将每个用户在 58 (from 290/5) 范围内进行分类，例如 Range A: 0-58、范围 B:59-116、范围 C:117-174...等

有没有其他方法可以使用 pandas 或 numpy 来实现这一点，以便我可以直接将列归入给定类别?预期输出:-

>>>df用户 ActivityCount 类别/范围用户0 220 D用户 1 190 D用户 2 105 B用户 3 109 B用户 4 271 E用户 5 265 E...用户95 64 BUser96 15 A用户97 168 C用户 98 251 E用户 99 278 E

解决方案

最自然的方法是将数据拆分为 5 个数量，然后根据这些数量将数据拆分为多个 bin.幸运的是，pandas 可以让你轻松做到这一点:

df["category"] = pd.cut(df.Activity, 5, labels= ["a","b", "c", "d", "e"])

输出类似于:

 活动类别34 115 羽15 43 一个57 192 天78 271 电子26 88羽6 25 一55 186 天63 220 天1 15 一76 268 电子

另一种观点 - 聚类

在上述方法中，我们将数据分成 5 个 bin，其中不同 bin 的大小相等.另一种更复杂的方法是将数据分成 5 个集群，并旨在使每个集群中的数据点尽可能相似.在机器学习中，这被称为聚类/分类问题.

一种经典的聚类算法是 k-means.它通常用于具有多个维度(例如每月活动、年龄、性别等)的数据.因此，这是一个非常简单的聚类案例.

在这种情况下，k-means聚类可以通过以下方式完成:

导入scipy从 scipy.cluster.vq 导入 vq、kmeans、whitendf = pd.DataFrame({"Activity": l})features = np.array([[x] for x in df.Activity])白化 = 白化(特征)码本，失真 = kmeans(白化，5)代码，dist = vq(白化，码本)df["类别"] = 代码

输出如下:

 活动类别40 138 179 272 072 255 013 38 341 139 165 231 026 88 259 197 476 268 045 145 1

注意事项:

类别的标签是随机的.在这种情况下，标签 '2' 指的是比 lavel '1' 更高的活动.
我没有将标签从 0-4 迁移到 A-E.这可以使用熊猫的 map 轻松完成.

I have a dataframe column which specifies how many times a user has performed an activity.eg.

>>> df['ActivityCount']
Users     ActivityCount
User0     220
User1     190
User2     105
User3     109
User4     271
User5     265
     ...
User95     64
User96     15
User97    168
User98    251
User99    278
Name: ActivityCount, Length: 100, dtype: int32


>>> activities = sorted(df['ActivityCount'].unique())
[9, 15, 16, 17, 20, 23, 25, 26, 28, 31, 33, 34, 36, 38, 39, 43, 49, 57, 59, 64, 65, 71, 76, 77, 78,
83, 88, 94, 95, 100, 105, 109, 110, 111, 115, 116, 117, 120, 132, 137, 138, 139, 140, 141, 144, 145, 148, 153, 155, 157, 162, 168, 177, 180, 182, 186, 190, 192, 194, 197, 203, 212, 213, 220, 223, 231, 232, 238, 240, 244, 247, 251, 255, 258, 260, 265, 268, 269, 271, 272, 276, 278, 282, 283, 285, 290]

According to their ActivityCount, I have to divide users into 5 different categories eg A, B, C, D and E.Activity Count range varies from time to time. In the above example it's approx in-between (9-290) (lowest and highest of the series), it could be (5-500) or (5 to 30).In above example, I can take the max number of activities and divide it by 5 and categorize each user between the range of 58 (from 290/5) like Range A: 0-58, Range B: 59-116, Range C: 117-174...etc

Is there any other way to achieve this using pandas or numpy, so that I can directly categorize the column in the given categories?Expected output: -

>>> df
Users     ActivityCount  Category/Range
User0     220             D
User1     190             D
User2     105             B
User3     109             B
User4     271             E
User5     265             E
     ...
User95     64             B
User96     15             A
User97    168             C
User98    251             E
User99    278             E

解决方案

The natural way to do that would be to split the data into 5 quanties, and then split the data into bins based on these quantities. Luckily, pandas allows you do easily do that:

df["category"] = pd.cut(df.Activity, 5, labels= ["a","b", "c", "d", "e"])

The output is something like:

    Activity Category
34       115        b
15        43        a
57       192        d
78       271        e
26        88        b
6         25        a
55       186        d
63       220        d
1         15        a
76       268        e

An alternative view - clustering

In the above method, we've split the data into 5 bins, where the sizes of the different bins are equal. An alternative, more sophisticated approach, would be to split the data into 5 clusters and aim to have the data points in each cluster as similar to each other as possible. In machine learning, this is known as a clustering / classification problem.

One classic clustering algorithm is k-means. It's typically used for data with multiple dimensions (e.g. monthly activity, age, gender, etc.) This is, therefore, a very simplistic case of clustering.

In this case, k-means clustering can be done in the following way:

import scipy
from scipy.cluster.vq import vq, kmeans, whiten

df = pd.DataFrame({"Activity": l})

features = np.array([[x] for x in df.Activity])
whitened = whiten(features)
codebook, distortion = kmeans(whitened, 5)
code, dist = vq(whitened, codebook)

df["Category"] = code

And the output looks like:

    Activity  Category
40       138         1
79       272         0
72       255         0
13        38         3
41       139         1
65       231         0
26        88         2
59       197         4
76       268         0
45       145         1

A couple of notes:

The labels of the categories are random. In this case label '2' refers to higher activity than lavel '1'.
I didn't migrate the labels from 0-4 to A-E. This can easily be done using pandas' map.

这篇关于Python生成特定长度的唯一范围并对其进行分类的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！