问题描述
我被告知使用geom_jitter而不是geom_points,并且帮助中给出的原因是它能在较小的数据集中更好地处理overplotting。我很困惑,重叠绘图意味着什么,以及它为什么出现在较小的数据集中? 一个或多个点位于同一地点(或足够接近同一地点),以至于无法查看情节并告知有多少点。
两个(不是互斥)的情况经常导致重叠绘图:
-
非连续数据 - 例如,如果
x
或y
是整数,那么很难说出有多少点。 -
大量数据 - 如果数据密集(或者具有高密度区域),那么即使
x
和<$ c
$ bJittering 是是在数据中添加少量的随机噪声。它通常用于分散否则会出现重叠的点。它仅在非连续的数据情况下才有效,在这种情况下,通常情况下,通过空格包围超量绘图点 - 将数据抖动到空白字符中可以看到单个点。它对离散数据进行有效的非离散化。
对于高密度数据,抖动并没有帮助,因为重叠点周围没有可靠的空白区域。其他用于缓解重叠绘图的常见技术包括使用透明度小于
- 使用透明度 的
- $
$ b示例b $ b
- 装箱数据(如在热图中)
?geom_jitter
):
p = ggplot(mpg,aes(cyl,hwy))
gridExtra :: grid.arrange(
p + geom_point(),
p + geom_jitter(width = 0.25,height = 0.5)
)
在上面,移动这些点只是将它们展开。现在我们可以看到有多少点是真正存在的,没有太多的数据改变,我们不明白它。
而不是处理更大的数据:
p2 = ggplot(钻石,aes(克拉,价格))
gridExtra :: grid.arrange(
p2 + geom_point(),
p2 + geom_jitter(),
p2 + geom_point(alpha = 0.1,shape = 16)
)
下面,抖动情节(中间)与常规情节(上)相同。点周围没有开放空间来散布它们。但是,通过较小的点数和透明度(底部图),我们可以感受到数据的密度。
I was told to use geom_jitter over geom_points and reason given in help is it handle overplotting better in smaller dataset. I am confused what does overplotting mean and why it occurs in smaller datasets?
解决方案Overplotting is when one or more points are in the same place (or close enough to the same place) that you can't look at the plot and tell how many points are there.
Two (not mutually exclusive) cases that often lead to overplotting:
Noncontinuous data - e.g., if
x
ory
are integers, then it will be difficult to tell how many points there are.Lots of data - if your data is dense (or has regions of high density), then points will often overlap even if
x
andy
are continuous.
Jittering is adding a small amount of random noise to data. It is often used to spread out points that would otherwise be overplotted. It is only effective in the non-continuous data case where overplotted points typically are surrounded by whitespace - jittering the data into the whitespace allows the individual points to be seen. It effectively un-discretizes the discrete data.
With high density data, jittering doesn't help because there is not a reliable area of whitespace around overlapping points. Other common techniques for mitigating overplotting include
- using smaller points
- using transparency
- binning data (as in a heat map)
Example of jitter working on small data (adapted from
?geom_jitter
):p = ggplot(mpg, aes(cyl, hwy)) gridExtra::grid.arrange( p + geom_point(), p + geom_jitter(width = 0.25, height = 0.5) )
Above, moving the points just a little bit spreads them out. Now we can see how many points are "really there", without changing the data too much that we don't understand it.
And not working on bigger data:
p2 = ggplot(diamonds, aes(carat, price)) gridExtra::grid.arrange( p2 + geom_point(), p2 + geom_jitter(), p2 + geom_point(alpha = 0.1, shape = 16) )
Below, the jittered plot (middle) is just as overplotted as the regular plot (top). There isn't open space around the points to spread them into. However, with a smaller point mark and transparency (bottom plot) we can get a feel for the density of the data.
这篇关于R中简单语言中geom_point和geom_jitter有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!