本文介绍了Seaborn 数据可视化对密度的误解?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 seaborn 库进行数据可视化,并尝试显示标准正态分布.这种情况下的基本知识类似于:

将 numpy 导入为 np将 seaborn 作为 sns 导入n=1000N= np.random.randn(n)fig=sns.displot(N,kind=kde")

这符合预期.当我尝试同时绘制多个分布时,我的问题就开始了.我尝试了粗暴的 N2= np.random.randn(n//2)fig=sns.displot((N,N2),kind=kde"),它返回两个分布(根据需要),但样本量较小的分布显着不同(并且更平坦).无论样本大小如何,适当的密度图(或直方图)都应使图形下方的面积等于 1,但显然情况并非如此.

知道 seaborn 可与 Pandas Dataframes 一起使用,我尝试使用下面更复杂(通常很糟糕且效率低下,但我希望清晰)的代码来再次尝试在同一图表上进行多个分布:

将 numpy 导入为 np将 seaborn 作为 sns 导入将熊猫导入为 pdn=10000N_1= np.reshape(np.random.randn(n),(n,1))N_2= np.reshape(np.random.randn(int(n/2)),(int(n/2),1))N_3= np.reshape(np.random.randn(int(n/4)),(int(n/4),1))A_1 = np.reshape(np.array(['n1' for _ in range(n)]),(n,1))A_2 = np.reshape(np.array(['n2' for _ in range(int(n/2))]),(int(n/2),1))A_3 = np.reshape(np.array(['n3' for _ in range(int(n/4))]),(int(n/4),1))F_1=np.concatenate((N_1,A_1),1)F_2=np.concatenate((N_2,A_2),1)F_3=np.concatenate((N_3,A_3),1)F= pd.DataFrame(data=np.concatenate((F_1,F_2,F_3),0),columns=[datar",cat"])F[datar"]=F.datar.astype('float')fig=sns.displot(F,x=datar",hue=cat",kind=kde")

这再次显示了非常不同(几乎按比例缩放)的分布,确认在这种情况下的结果与我的预期不一致(即大致重叠的分布).我不明白这个图表是如何工作的吗?有一种完全不同的方法可以在我遗漏的同一图形上绘制多个分布?

解决方案

Seaborn 在使用和不使用数据帧的情况下都能愉快地工作.数据帧的列被转换为 numpy 数组以绘制绘图.

请注意,对于 kdeplot,选项 common_norm 默认为 True 是有意义的,与 kdeplot 一样,您可以还创建具有三个独立调用的图,这些调用将自动独立.还有一个有用的选项 multiple(默认为 'layer'),可以设置为 'stack''填充'.

I was playing around with the seaborn library for data visualization and trying to display a standard normal distribution. The basics in this case look something like:

import numpy as np
import seaborn as sns

n=1000
N= np.random.randn(n)
fig=sns.displot(N,kind="kde")

Which behaves as expected. My problem starts when I try to plot multiple distributions at the same time. I tried the brute N2= np.random.randn(n//2) and fig=sns.displot((N,N2),kind="kde"), which returns two distributions (as wanted), but the one with smaller sample size is significantly different (and flatter). Regardless of the sample size, a proper density plot (or histogram) should have the area below the graph equal to one, but this is clearly not the case.

Knowing that seaborn works with pandas Dataframes, I've tried with the more elaborate (and generally bad and inefficient, but I hope clear) code below to attempt again multiple distributions on the same graph:

import numpy as np
import seaborn as sns
import pandas as pd
n=10000

N_1= np.reshape(np.random.randn(n),(n,1))
N_2= np.reshape(np.random.randn(int(n/2)),(int(n/2),1))
N_3= np.reshape(np.random.randn(int(n/4)),(int(n/4),1))

A_1 = np.reshape(np.array(['n1' for _ in range(n)]),(n,1))
A_2 = np.reshape(np.array(['n2' for _ in range(int(n/2))]),(int(n/2),1))
A_3 = np.reshape(np.array(['n3' for _ in range(int(n/4))]),(int(n/4),1))

F_1=np.concatenate((N_1,A_1),1)
F_2=np.concatenate((N_2,A_2),1)
F_3=np.concatenate((N_3,A_3),1)

F= pd.DataFrame(data=np.concatenate((F_1,F_2,F_3),0),columns=["datar","cat"])
F["datar"]=F.datar.astype('float')
fig=sns.displot(F,x="datar",hue="cat",kind="kde")

Which shows again very different (almost scaled) distributions, confirming that the result in this case is not consistent with what I was expecting (namely, roughly overlapping distributions). Am I not understanding how this graph works? There is a completely different approach to draw multiple distributions on the same graph that I am missing?

解决方案

Seaborn works happily with and without dataframes. Columns of dataframes get converted to numpy arrays in order to draw the plots.

sns.displot(..., kind="kde") refers to sns.kdeplot() which has a parameter common_norm defaulting to True. Setting it to False draws the curves independently.

import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

n = 10000

N_1 = np.random.randn(n)
N_2 = np.random.randn(n // 2) + 2
N_3 = np.random.randn(n // 4) + 4

sns.displot((N_1, N_2, N_3), kind="kde", common_norm=False)
plt.show()

Note that for kdeplot, the option common_norm defaulting to True makes sense, as with kdeplot you can also create plots with three separate calls which automatically will be independent. There also is a useful option multiple (defaulting to 'layer'), which can be set to 'stack' or to 'fill'.

这篇关于Seaborn 数据可视化对密度的误解?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 05:16