问题描述
我试图在R中绘制由ggplot生成的大型热图.最终,我想使用Illustrator'抛光'该热图.
I am trying to plot a large heatmap, generated with ggplot, in R. Ultimately, I would like to 'polish' this heat map using Illustrator.
示例代码:
# Load packages (tidyverse)
library(tidyverse)
# Create dataframe
df <- expand.grid(x = seq(1,100000), y = seq(1,100000))
# add variable: performance
set.seed(123)
df$z <- rnorm(nrow(df))
ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z))
尽管我将图另存为矢量图像(.pdf;不是很大),但是打开时pdf的加载速度非常慢.我希望打开文件时会渲染数据框中的每个单独点.
Although I save the plot as a vectorized image (.pdf; that is not that large), the pdf is loading very slowly when opening. I expect that every individual point in the data frame is rendered when opening the file.
我还阅读了其他文章(例如数据探索在R中:快速显示大矩阵的热图?)使用image()
可视化矩阵,但是我想使用ggplot修改图像.
I have read other posts (e.g. Data exploration in R: display heatmap of large matrix, quickly?) that use image()
to visualize matrices, however I would like to use ggplot to modify the image.
问题:我如何加快此图的渲染速度?有没有办法(除了降低绘图的分辨率),同时保持图像矢量化,从而加快此过程?是否可以对矢量化ggplot进行下采样?
Question: How do I speed up the rendering of this plot? Is there a way (besides lowering the resolution of the plot), while keeping the image vectorized, to speed this process up? Is it possible to downsample a vectorized ggplot?
推荐答案
我尝试做的第一件事是stat_summary_2d
进行平均装箱,但它看起来很慢,并且还在右侧和顶部边缘产生了一些瑕疵:
The first thing I tried was stat_summary_2d
to get average binning, but it seemed slow and also created some artifacts on the right and top edges:
library(tidyverse)
df <- expand.grid(x = seq(1,1000), y = seq(1,1000))
set.seed(123)
df$z <- rnorm(nrow(df))
print(object.size(df), units = "Mb")
#15.4 Mb
ggplot(data = df, aes(x = x, y = y, z = z)) +
stat_summary_2d(bins = c(100,100)) + #10x downsample, in this case
scale_x_continuous(breaks = 100*0:10) +
labs(title = "stat_summary_2d, 1000x1000 downsampled to 100x100")
即使这比您建议的数据小得多,这仍然需要大约3秒钟才能绘制在我的机器上,并且在顶部和右侧边缘都有伪影,我推测是因为这些垃圾箱的边缘较小.更多变化.
Even though this is much smaller than your suggested data, this still took about 3 seconds to plot on my machine, and had artifacts on the top and right edges, I presume due to those bins being smaller ones from the edges, leaving more variation.
当我尝试按照您的要求尝试更大的网格时,它的速度变慢了.
It got slower from there when I tried a larger grid like you are requesting.
(顺便说一句,可能需要澄清的是,像PDF这样的矢量图形文件可以与栅格图形不同地进行调整大小,而不会降低分辨率.但是,在这种使用情况下,输出为10,000兆像素的栅格文件, far 超出了人类的感知范围,已被导出为矢量格式,其中每个像素"在PDF中都变成一个非常小的矩形.矢量格式的使用可能对某些异常情况很有用在某些情况下,例如您需要在不损失分辨率的情况下放大热图,而不是像足球场一样在巨大的表面上,但是听起来像在这种情况下,这可能是错误的工作工具,因为您要放很多东西数据导入矢量文件中是不可察觉的.)
(As an aside, it may be worth clarifying that a vector graphic file like a PDF, unlike a raster graphic, can be resized without loss of resolution. However, in this use case, the output is 10,000 megapixel raster file, far beyond the limits of human perception, that is getting exported into a vector format, where each "pixel" becomes a very tiny rectangle in the PDF. That use of a vector format could be useful for certain unusual cases, like if you ever need to blow up your heatmap without loss of resolution onto a gigantic surface, like a football field. But it sounds like in this case it might be the wrong tool for the job, since you're putting heaps of data into the vector file that won't be perceptible.)
更有效的方法是在ggplot
之前对dplyr
进行平均.这样,我可以将一个10k x 10k的数组进行下采样100倍,然后再发送到ggplot.这必然会降低分辨率,但是在这种用例中,保持分辨率超出了人类的感知能力之外,我不理解它的价值.
What worked more efficiently was to do the averaging with dplyr
before ggplot
. With that, I could take a 10k x 10k array and downsample it 100x before sending to ggplot. This necessarily reduces the resolution, but I don't understand the value in this use case of preserving resolution beyond human abilities to perceive it.
这里有一些代码可以自己完成存储,然后绘制降采样的版本:
Here's some code to do the bucketing ourselves and then plot the downsampled version:
# Using 10k x 10k array, 1527.1 Mb when initialized
downsample <- 100
df2 <- df %>%
group_by(x = downsample * round(x / downsample),
y = downsample * round(y / downsample)) %>%
summarise(z = mean(z))
ggplot(df2, aes(x = x, y = y)) +
geom_raster(aes(fill = z)) +
scale_x_continuous(breaks = 1000*0:10) +
labs(title = "10,000x10,000 downsampled to 100x100")
这篇关于从R中的ggplot加速大型热图的渲染的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!