问题描述
我有一个软件,必须处理大量不同的数据,并且可能需要花费不同的时间来处理它.随着软件的修订,处理数据所需的时间发生变化,因此我想创建一个图表来显示时间差异和异常值,因为理想情况下,该程序应该为每个部分花费大约相同的时间数据(我知道这听起来很奇怪且不切实际,但请随我一起来).
I have a piece of software that has to process lots of different data and can take a varying amount of time to process it. As the software gets revised, the time needed to process the data changes, and so I want to create a graph that shows the variance in time as well as outliers, because ideally, this program should take about the same amount of time for each piece of data (It sounds strange and unrealistic, I know, but just roll with me here).
起初,我想使用箱线图,但我认为它们不够用,因为完全有可能让数据集的一半悬停在一个值附近,另一半悬停在另一个值附近,我没有感觉到箱线图可以很好地说明这一点.所以我决定尝试使用直方图,但我不知道如何让 matplotlib 以我想要的方式绘制它.我想要一个数字,X轴标记有软件版本,Y轴显示处理数据集所花费的时间,并带有多个直方图,例如我制作的模型:
At first, I thought about using box plots, but I thought they were inadequate because it is entirely possible to have half of a data set hovered around one value, with the other half hovered around another, and I didn't feel a box plot would illustrate that well. So I decided to try using a histogram, but I can't figure out how to get matplotlib to draw it the way I want it. I want a single figure, the X-axis being labeled with software versions, the Y-axis showing time taken to process a data set, with multiple histograms, like this mockup I made:
此图显示,在 0.1 版本中,大多数数据集在 2-4 秒内处理完毕,而一堆数据集出于某种原因需要 12 秒.v0.1a 去掉了那些长的离群值,但一切都花了更长的时间.0.1b仅比0.1a快一点.最后,0.2 显示了很大的速度提升,但再次引入了异常值.
This graph would show that in version 0.1, most data sets were processed in 2-4 seconds, with a bunch of sets for some reason taking 12 seconds. v0.1a got rid of those long outliers, but everything took longer. 0.1b is just slighty fast than 0.1a. Finally, 0.2 shows much speed improvement, but introduced outliers again.
我怎样才能让 matplotlib 创建一个这样的图?
How can I get matplotlib to create a plot like that?
推荐答案
这是如何实现这一点的(非常)基本模型:
Here is a (very) basic mockup of how this can be achieved:
import matplotlib.pyplot as plt
import numpy as np
number_of_bins = 20
number_of_data_points = 1000
ax = plt.subplot(111)
data_set = [np.random.normal(0, 1, number_of_data_points),
np.random.normal(6, 1, number_of_data_points),
np.random.normal(-3, 1, number_of_data_points)]
MID_VALUES = [0, 200, 400]
labels = ["v1", "v2", "v3"]
for MID_VAL, y in zip(MID_VALUES, data_set):
hist, bin_edges = np.histogram(y, bins=number_of_bins)
bottom = bin_edges[:-1]
heights = np.diff(bin_edges)
lefts = MID_VAL - .5 * hist
ax.barh(bottom, hist, height=heights, left=lefts)
ax.set_xticks(MID_VALUES)
ax.set_xticklabels(labels)
plt.show()
我缺乏很多改进,例如:手动选择 MID_VALUES
,这将取决于数据集并且可以自动进行.但是,您也许可以将其转换为更可用的格式.
This lacks a lot of refinement I admit, for example: the MID_VALUES
are chosen by hand,this will depend on the data set and could be automated. Nevertheless, you may be able to get it into a more usable format.
这篇关于与matplotlib并排的多个直方图?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!