问题描述
假设我有以下数据:
s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
s2.value_counts(normalize=True).plot()
我想在图中显示的是,有几个数字构成了大多数情况.问题是,这将在图形的最左侧看到,然后会出现一条直线对于所有其他类别.在实际数据中,x 轴将是分类的,大约有 18000 个类别,4% 的计数将在 10000 左右高,然后其余的将下降并约为 50.
What I want to show in the plot is that there are a few numbers that make up the majority of cases.The problem is that this will be seen in the far left side of the graph and then there will be a straight line for all the other categories.In the real data the x axis will be categorical with about 18000 categories and 4% of the counts will be around 10000 high then the rest will drop of and be around 50.
我想向普通"业务人员的观众展示这个,所以不能成为一些难以阅读的疯狂解决方案.
I want to show this for an audience of "ordinary" business people so cant be some fanzy hard to read solution.
更新:见@unutbu answere更新了代码,我在尝试使用元组时遇到 qcut
错误.
Update: see @unutbu answereUpdated code and im getting an error for qcut
when trying to use tuples.
TypeError: unsupported operand type(s) for -: 'tuple' and 'tuple'
df = pd.DataFrame({'s1':[1,0,1,0], 's2':[1,0,1,1], 's3':[1,0,1,1], 's4':[0,0,0,1]})
perms = df.apply(tuple, axis=1)
prob = perms.value_counts(normalize=True).reset_index(drop='True')
category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.],
labels=['bottom 25%', 'mid 70%', 'top 5%'])
prob_groups = prob.groupby(category_classes).sum()
prob_groups.plot(kind='bar')
plt.xticks(rotation=0)
plt.show()
推荐答案
您可以将标准化值计数保持在某个 阈值
以上.然后将 threshold
以下的值相加,并将它们归为一类,可以称为其他".
You could keep the normalized value counts above a certain threshold
. Then sum together the values below the threshold
and clump them together in one category which could be called, say, "other".
通过选择足够高的 threshold
,您将能够显示对整体概率分布最重要的贡献者,同时仍然在标记为其他"的栏中显示尾部的大小:
By choosing threshold
high enough, you will able to display the most important contributors to the overall probability distribution, while still showing the size of the tail in the bar labeled "other":
import matplotlib.pyplot as plt
import pandas as pd
s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
prob = s2.value_counts(normalize=True)
threshold = 0.02
mask = prob > threshold
tail_prob = prob.loc[~mask].sum()
prob = prob.loc[mask]
prob['other'] = tail_prob
prob.plot(kind='bar')
plt.xticks(rotation=25)
plt.show()
您可以合理地显示在一个类别标签上的数量是有限的条状图.对于正常大小的图形,3000 太多了.此外,它是期望观众从中收集任何意义可能是不合理的读取 3000 个标签.
There is a limit to the number of category labels you can sensibly display on abar graph. For a normal-sized graph 3000 is way too many. Moreover, it isprobably not reasonable to expect an audience to glean any meaning out ofreading 3000 labels.
图表应汇总数据.重点似乎是 4% 或 5% 的类别构成了绝大多数案例.因此,为了说明这一点,也许可以使用 pd.qcut
将案例分类为简单的类别,例如 bottom 25%
、mid 70%
,和 前 5%
:
The graph should summarize the data. And the main point seems to be that 4 or 5% of the categories constitute the vast majority of the cases. So to drive home that point, perhaps use pd.qcut
to categorize the cases into simple categories such as bottom 25%
, mid 70%
, and top 5%
:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
N = 18000
categories = np.arange(N)
np.random.shuffle(categories)
M = int(N*0.04)
prob = pd.Series(np.concatenate([np.random.randint(9000, 11000, size=M),
np.random.randint(0, 100, size=N-M), ]), index=categories)
prob /= prob.sum()
category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.],
labels=['bottom 25%', 'mid 70%', 'top 5%'])
prob_groups = prob.groupby(category_classes).sum()
prob_groups.plot(kind='bar')
plt.xticks(rotation=0)
plt.show()
这篇关于如何在 Pandas 中绘制 value_counts 的值,该值具有大量不均匀分布的不同计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!