在文献中,我们经常遇到随机森林和相关性热图的组合图片(下图),它由一幅叠加变量重要性圆圈的相关性热图和一幅说明因变量被解释程度的条形图组成。今天,我们将试着用自己的数据在R里面去复现这类图。

原图

   先看看所需复现的随机森林变量重要性+相关性热图:

R语言ggplot2 | 绘制随机森林重要性+相关性热图-LMLPHP
这类图片由两部分组成。其中容易理解的就是下面的相关性热图,它是核心微生物(y轴)与土壤养分变量(x轴)的相关性分析。剩下的圆圈和上面的条形图都可以归类到随机森林分析部分。它将核心微生物作为自变量,分别将各个土壤养分变量作为因变量,进行随机森林分析。每一次随机森林分析得到的变量重要性指数则映射为热图里的圆圈大小(只收集p<0.05的变量的重要性),而模型对目标土壤养分变量的解释程度则作为上面条形图柱子的高度。接下来,我们来模仿复现这张图。

复现

准备数据集及数据处理

rm(list = ls())
# 在R中绘制随机森林模型中变量重要性和相关性的热图
library(vegan)
library(randomForest)
library(rfUtilities)
library(rfPermute)
library(ggplot2)
library(tidyverse)
library(patchwork)
library(export)
library(reshape2)

Micro <- openxlsx::read.xlsx("OTU1.xlsx", 1)
ENV <- openxlsx::read.xlsx("ENV1.xlsx", 1);head(ENV)
# Sample Ecosystem       SOC       pH       CEC        TP       TK       AK        TN        NO3        NH4      DOC
# 1     A1 Grassland  27.05671 7.904906 15.323426 0.9808826 14.75433 180.5585 2.1198671 125.335557  0.3171696 118.5123
# 2     A2 Grassland  25.76547 8.573666 18.104304 0.8527650 14.69093 166.8692 2.4470941 107.296478  2.4062394 124.6120
# 3     A3    Forest  12.33270 8.277955  8.297831 0.4298910 12.94322 121.5713 0.7090047   6.961546  2.1797286 129.7396
# 4     A4    Forest  16.63684 7.150481 15.632651 0.3711634 13.53909 194.0584 1.0868895  23.967231 29.2321159 129.3390
# 5     A5  Cropland 131.72950 7.548969 30.291618 0.6691908 14.19519 144.4154 5.4975148  38.032274 17.6463368 106.5602
# 6     A6  Cropland 123.75436 7.840254 34.357811 0.5828902 14.51699 387.8284 7.9689639  59.413622 12.7478292 105.8368
# MAT       SD       Pl
# 1 2.0162632 1.855115 274.0981
# 2 2.1286678 2.370189 369.1070
# 3 3.5848455 2.037255 263.5001
# 4 2.6846039 2.026378 346.7367
# 5 0.5787488 2.380261 333.7921
# 6 0.6756787 1.590068 240.1557
# 计算香浓指数,并把它合并在环境变量数据中
ENV$shannon <- diversity(t(Micro), index = "shannon")

# 生成一个总的与先前的不同生态系统合并成一个新的数据集
ENV_total <- ENV
ENV_total$Ecosystem <- "Whole"

ENV_new <- rbind(ENV_total, ENV)

# 确定不同生态系统类型的数量
unique(ENV_new$Ecosystem)
# [1] "Whole"     "Grassland" "Forest"    "Cropland"  "Wetland"   "Desert" 

自变量数据是一组虚拟的土壤理化指标数据,它来自5个生态系统:草地、森林、农田、湿地、荒漠,和一个总体的。随后将采样点微生物的香浓指数合并到环境变量中。我们计划将数据按生态系统包括总体共6类,计算每一类生态系统中土壤环境变量对微生物香浓指数的潜在贡献。这里的思路是通过并行运算每类的随机森林模型,提高运行速度。

构建不同分类随机森林模型的并行计算

# 构建自定义函数进行并行运算
library(foreach)
library(doParallel)

# 设置核心数
n_cores <- detectCores() - 1
cl <- makeCluster(n_cores)
registerDoParallel(cl)

# 获取生态系统列表
ecosystems <- unique(ENV_new$Ecosystem)

# 设置输出结果名称
options <- list()
options$results <- ecosystems

# 定义一个函数,对每个生态系统运行随机森林分析
run_single_rf <- function(data, eco) {
  if (!"Ecosystem" %in% names(data)) {
    stop("Error: 'Ecosystem' variable not found in data.")
  }
  if (!(eco %in% unique(data$Ecosystem))) {
    stop("Error: Invalid Ecosystem value.")
  }
  # 对当前生态系统的数据进行子集
  eco_data <- subset(data, Ecosystem == eco)[,-1:-2]
  # 设置种子
  set.seed(1)
  # 进行随机森林分析和置换检验
  RF <- randomForest(shannon~., eco_data, importance = T)
  RF_r2 <- rf.significance(RF, eco_data, nperm=99, ntree=501)$Rsquare
  RF_permuted <- rfPermute(shannon~., eco_data, nperm=99, ntree=501)
  # 返回结果
  list(ecosystem = eco, RF = RF, RF_r2 = RF_r2, RF_permuted = RF_permuted)
}

results <- foreach(i = ecosystems, .combine = rbind) %dopar% {
  library(randomForest)
  library(rfUtilities)
  library(rfPermute)
  # 运行单个随机森林分析
  run_single_rf(data = ENV_new, eco = i)
}

# 停止并行处理
stopCluster(cl)
registerDoSEQ()

# 确认结果
head(results)
# ecosystem   RF                      RF_r2     RF_permuted
# result.1 "Whole"     randomForest.formula,18 0.5722844 rfPermute,6
# result.2 "Grassland" randomForest.formula,18 0.6213995 rfPermute,6
# result.3 "Forest"    randomForest.formula,18 0.4347413 rfPermute,6
# result.4 "Cropland"  randomForest.formula,18 0.4695682 rfPermute,6
# result.5 "Wetland"   randomForest.formula,18 0.1822965 rfPermute,6
# result.6 "Desert"    randomForest.formula,18 0.5186782 rfPermute,6

results[, "RF_permuted"][[1]]
# An rfPermute model
# 
# Type of random forest: regression 
# Number of trees: 501 
# No. of variables tried at each split: 4 
# No. of permutation replicates: 100 
# Start time: 2023-04-18 22:56:38 
# End time: 2023-04-18 22:57:05 
# Run time: 26.8 secs 

这个results结果看起来想data.frame,实际上是一个Large list。
使用随机森林计算整体数据中环境生物因子对微生物香浓指数的重要性。

  1. 这里的%Var explained就是我们画柱状图时柱子的高度,大家可以抄录下来。
  2. importance函数调取了每个变量对shannon指数的重要性,用它们定义热图中空心圆的大小,但我们只需要调取p值小于0.05的变量。(将在后文统一调用)

绘制随机森林变量重要性柱状图

# 绘制柱状图
names <- c("ecosystem", "RF_r2")
bar <- as.data.frame(results[,names])
bar$ecosystem <- unlist(bar$ecosystem)
bar$RF_r2 <- unlist(bar$RF_r2)
str(bar)
# 'data.frame':	6 obs. of  2 variables:
# $ ecosystem: chr  "Whole" "Grassland" "Forest" "Cropland" ...
# $ RF_r2    : num  0.572 0.621 0.435 0.47 0.182 ...
colnames(bar) <- c("variable", "Exp")
bar$variable<- factor(bar$variable)
barplot<- ggplot(bar, aes(variable, Exp))+
  geom_bar(stat = "identity", fill = "steelblue")+
  scale_y_continuous(expand = c(0,0))+
  theme_classic(base_line_size = 0.75)+
  theme(panel.background = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.y = element_text(color = "black", size = 12))+
  labs(title = "Explained variation (%)", size = 12)
barplot

R语言ggplot2 | 绘制随机森林重要性+相关性热图-LMLPHP

六组随机森林分析的解释量条形图已经完成,我们可以计算并绘制相关性热图部分。

计算数据集的相关性

使用cor()函数直接计算矩阵的相关性,相关矩阵的第14行,1到13列正是我们需要的香浓指数与13个环境因子的相关性分析结果。把它整理在一个表格内,并整理成绘图用的tidy数据,并对环境因子进行排序

注: cor()默认使用Pearson算法,如果对数据正态性不自信的话,建议修改为Spearman算法。

# 打包6个分类的随机森林模型结果
rf_list <- results[, "RF_permuted"]
# 修改6个分类模型的名称,改为对应的分类
names(rf_list) <- ecosystems
# 生成变量重要性的值,如果p值小于0.05则表示对应变量解释量
important_vars <- do.call(rbind, lapply(seq_along(results[, "RF_permuted"]), function(i) data.frame(ENV = row.names(importance(rf_list[[i]])), 
                                                            Important = ifelse(importance(rf_list[[i]])[, 2] < 0.05, 
                                                                               importance(rf_list[[i]])[, 1], NA),
                                                            Eco = rep(names(rf_list[i]), each = nrow(importance(rf_list[[i]]))))))
# 合并数据
circle <- data.frame(ENV = colnames(ENV)[3:15]) %>%
  left_join(important_vars) %>% 
  melt(id = c("ENV", "Eco"), value.name = "Importance"); circle
# Joining, by = "ENV"
# ENV       Eco  variable Importance
# 1  SOC     Whole Important         NA
# 2  SOC Grassland Important         NA
# 3  SOC    Forest Important  13.227866
# 4  SOC  Cropland Important         NA
# 5  SOC   Wetland Important         NA
# 6  SOC    Desert Important         NA
# 7   pH     Whole Important         NA
# 8   pH Grassland Important         NA
# 9   pH    Forest Important         NA
# 10  pH  Cropland Important         NA
# 11  pH   Wetland Important         NA
# 12  pH    Desert Important         NA
# 13 CEC     Whole Important         NA
# 14 CEC Grassland Important         NA
# 15 CEC    Forest Important         NA
# 16 CEC  Cropland Important   6.204274
# 17 CEC   Wetland Important         NA
# 18 CEC    Desert Important         NA
# 19  TP     Whole Important         NA
# 20  TP Grassland Important         NA
# 21  TP    Forest Important         NA
# 22  TP  Cropland Important         NA
# 23  TP   Wetland Important         NA
# 24  TP    Desert Important         NA
# 25  TK     Whole Important         NA
# 26  TK Grassland Important         NA
# 27  TK    Forest Important         NA
# 28  TK  Cropland Important   6.969350
# 29  TK   Wetland Important         NA
# 30  TK    Desert Important   6.783699
# 31  AK     Whole Important         NA
# 32  AK Grassland Important         NA
# 33  AK    Forest Important         NA
# 34  AK  Cropland Important   7.491691
# 35  AK   Wetland Important   7.208659
# 36  AK    Desert Important   6.601446
# 37  TN     Whole Important         NA
# 38  TN Grassland Important         NA
# 39  TN    Forest Important   6.953556
# 40  TN  Cropland Important         NA
# 41  TN   Wetland Important         NA
# 42  TN    Desert Important         NA
# 43 NO3     Whole Important   5.075570
# 44 NO3 Grassland Important         NA
# 45 NO3    Forest Important         NA
# 46 NO3  Cropland Important         NA
# 47 NO3   Wetland Important         NA
# 48 NO3    Desert Important         NA
# 49 NH4     Whole Important   4.594003
# 50 NH4 Grassland Important         NA
# 51 NH4    Forest Important   6.911031
# 52 NH4  Cropland Important         NA
# 53 NH4   Wetland Important         NA
# 54 NH4    Desert Important         NA
# 55 DOC     Whole Important  14.319739
# 56 DOC Grassland Important  18.475196
# 57 DOC    Forest Important  13.514009
# 58 DOC  Cropland Important         NA
# 59 DOC   Wetland Important         NA
# 60 DOC    Desert Important   5.634245
# 61 MAT     Whole Important         NA
# 62 MAT Grassland Important         NA
# 63 MAT    Forest Important         NA
# 64 MAT  Cropland Important         NA
# 65 MAT   Wetland Important         NA
# 66 MAT    Desert Important         NA
# 67  SD     Whole Important         NA
# 68  SD Grassland Important         NA
# 69  SD    Forest Important   6.683399
# 70  SD  Cropland Important   5.815041
# 71  SD   Wetland Important         NA
# 72  SD    Desert Important         NA
# 73  Pl     Whole Important  26.440467
# 74  Pl Grassland Important  12.635259
# 75  Pl    Forest Important         NA
# 76  Pl  Cropland Important  12.873442
# 77  Pl   Wetland Important   7.492399
# 78  Pl    Desert Important  15.139708

circle$ENV<- factor(circle$ENV, rev(colnames(ENV)[3:15]))

# 划分不同生态系统的数据集
df_eco <- split(ENV[, -1:-2], ENV$Ecosystem)
names(df_eco)

# 计算不同生态系统的相关性
r <- data.frame(ENV = colnames(ENV)[3:15],
               Whole = (cor(ENV[, -1:-2]))[14, 1:13],
               Grassland = (cor(df_eco$Grassland))[14, 1:13],
               Forest = (cor(df_eco$Forest))[14, 1:13],
               Cropland = (cor(df_eco$Cropland))[14, 1:13],
               Wetland = (cor(df_eco$Wetland))[14, 1:13],
               Desert = (cor(df_eco$Desert))[14, 1:13])%>%
  melt(id = "ENV", value.name = "Correlation");r
# ENV  variable  Correlation
# 1  SOC     Whole  0.190880826
# 2   pH     Whole -0.047988531
# 3  CEC     Whole  0.375623831
# 4   TP     Whole  0.204011032
# 5   TK     Whole  0.282682786
# 6   AK     Whole -0.238701400
# 7   TN     Whole  0.205713695
# 8  NO3     Whole -0.213326858
# 9  NH4     Whole -0.076868083
# 10 DOC     Whole  0.417261347
# 11 MAT     Whole -0.336919922
# 12  SD     Whole -0.272103426
# 13  Pl     Whole  0.590955861
# 14 SOC Grassland  0.389609962
# 15  pH Grassland -0.292803759
# 16 CEC Grassland  0.466440818
# 17  TP Grassland  0.240020091
# 18  TK Grassland  0.250093957
# 19  AK Grassland  0.295479703
# 20  TN Grassland  0.286305242
# 21 NO3 Grassland  0.060433644
# 22 NH4 Grassland -0.179338677
# 23 DOC Grassland  0.449023819
# 24 MAT Grassland -0.445949001
# 25  SD Grassland -0.511629899
# 26  Pl Grassland  0.657737041
# 27 SOC    Forest  0.419911131
# 28  pH    Forest -0.259555477
# 29 CEC    Forest  0.441782003
# 30  TP    Forest -0.043325378
# 31  TK    Forest  0.184701577
# 32  AK    Forest  0.314240260
# 33  TN    Forest  0.394795298
# 34 NO3    Forest  0.191051827
# 35 NH4    Forest -0.062053889
# 36 DOC    Forest  0.472024615
# 37 MAT    Forest -0.246051709
# 38  SD    Forest -0.427179574
# 39  Pl    Forest  0.445715297
# 40 SOC  Cropland  0.190504600
# 41  pH  Cropland  0.079739140
# 42 CEC  Cropland  0.372432772
# 43  TP  Cropland  0.173902720
# 44  TK  Cropland  0.339013659
# 45  AK  Cropland -0.567700216
# 46  TN  Cropland  0.179234867
# 47 NO3  Cropland -0.426516867
# 48 NH4  Cropland  0.047660104
# 49 DOC  Cropland  0.431862312
# 50 MAT  Cropland -0.388846551
# 51  SD  Cropland -0.129670246
# 52  Pl  Cropland  0.693855669
# 53 SOC   Wetland  0.062759256
# 54  pH   Wetland -0.230794196
# 55 CEC   Wetland  0.292344133
# 56  TP   Wetland  0.170356678
# 57  TK   Wetland  0.265791237
# 58  AK   Wetland  0.323268430
# 59  TN   Wetland -0.004741411
# 60 NO3   Wetland  0.172574329
# 61 NH4   Wetland -0.160782185
# 62 DOC   Wetland  0.436793024
# 63 MAT   Wetland -0.386445255
# 64  SD   Wetland -0.189241202
# 65  Pl   Wetland  0.343211982
# 66 SOC    Desert  0.008834840
# 67  pH    Desert  0.298885289
# 68 CEC    Desert  0.155105965
# 69  TP    Desert  0.253336024
# 70  TK    Desert  0.485088124
# 71  AK    Desert -0.554681227
# 72  TN    Desert  0.173607814
# 73 NO3    Desert -0.256845001
# 74 NH4    Desert -0.144874726
# 75 DOC    Desert  0.443678302
# 76 MAT    Desert -0.263961749
# 77  SD    Desert -0.160875488
# 78  Pl    Desert  0.648853903

r$ENV <- factor(r$ENV, levels = rev(colnames(ENV)[3:15]))

# 合并相关性与随机森林重要性的结果
r <- r %>% left_join(circle[c("ENV", "Importance")], by = "ENV");head(r)
# ENV variable Correlation Importance
# 1 SOC    Whole   0.1908808         NA
# 2 SOC    Whole   0.1908808         NA
# 3 SOC    Whole   0.1908808   13.22787
# 4 SOC    Whole   0.1908808         NA
# 5 SOC    Whole   0.1908808         NA
# 6 SOC    Whole   0.1908808         NA

热图可视化

heatmap <- ggplot()+
  geom_tile(data = r, aes(x = variable, y = ENV, fill = Correlation))+
  scale_fill_gradientn(colors = c('#2D6DB1', 'white', '#DC1623'),
                       limit = c(-1, 1))+
  geom_point(data = circle, aes(x = Eco, y = ENV,
                                size = Importance), shape = 21)+
  theme_bw()+
  theme(panel.background = element_blank(),
        panel.grid = element_blank(),
        axis.text.x = element_text(angle = 45, color = "black",
                                   size = 12, vjust = 0.6),
        axis.text.y = element_text(color = 'black', size = 12),
        legend.title = element_text(size = 10))+
  labs(y = '', x = '')
heatmap

R语言ggplot2 | 绘制随机森林重要性+相关性热图-LMLPHP

合并随机森林重要性和热图

# barplot + heatmap + 
#   plot_layout(ncol = 1, heights = c(1, 6))
barplot %>%
  aplot::insert_bottom(heatmap, height = 6)
# graph2ppt(file = "ppt.ppt", height = 8, width = 5, append = T)

R语言ggplot2 | 绘制随机森林重要性+相关性热图-LMLPHP
从结果来看,可以说是完美复刻!

附上所有代码

rm(list = ls())
# 在R中绘制随机森林模型中变量重要性和相关性的热图

library(vegan)
library(randomForest)
library(rfUtilities)
library(rfPermute)
library(ggplot2)
library(tidyverse)
library(patchwork)
library(export)
library(reshape2)

Micro <- openxlsx::read.xlsx("OTU1.xlsx", 1)
ENV <- openxlsx::read.xlsx("ENV1.xlsx", 1);head(ENV)
# Sample Ecosystem       SOC       pH       CEC        TP       TK       AK        TN        NO3        NH4      DOC
# 1     A1 Grassland  27.05671 7.904906 15.323426 0.9808826 14.75433 180.5585 2.1198671 125.335557  0.3171696 118.5123
# 2     A2 Grassland  25.76547 8.573666 18.104304 0.8527650 14.69093 166.8692 2.4470941 107.296478  2.4062394 124.6120
# 3     A3    Forest  12.33270 8.277955  8.297831 0.4298910 12.94322 121.5713 0.7090047   6.961546  2.1797286 129.7396
# 4     A4    Forest  16.63684 7.150481 15.632651 0.3711634 13.53909 194.0584 1.0868895  23.967231 29.2321159 129.3390
# 5     A5  Cropland 131.72950 7.548969 30.291618 0.6691908 14.19519 144.4154 5.4975148  38.032274 17.6463368 106.5602
# 6     A6  Cropland 123.75436 7.840254 34.357811 0.5828902 14.51699 387.8284 7.9689639  59.413622 12.7478292 105.8368
# MAT       SD       Pl
# 1 2.0162632 1.855115 274.0981
# 2 2.1286678 2.370189 369.1070
# 3 3.5848455 2.037255 263.5001
# 4 2.6846039 2.026378 346.7367
# 5 0.5787488 2.380261 333.7921
# 6 0.6756787 1.590068 240.1557
# 计算香浓指数,并把它合并在环境变量数据中
ENV$shannon <- diversity(t(Micro), index = "shannon")

# 生成一个总的与先前的不同生态系统合并成一个新的数据集
ENV_total <- ENV
ENV_total$Ecosystem <- "Whole"

ENV_new <- rbind(ENV_total, ENV)

# 确定不同生态系统类型的数量
unique(ENV_new$Ecosystem)
# [1] "Whole"     "Grassland" "Forest"    "Cropland"  "Wetland"   "Desert" 
# # 使用随机森林计算各生态系统中环境生物因子对微生物香浓指数的重要性
# set.seed(1)#设置种子,确保结果一致
# RF_total <- randomForest(shannon ~ . , ENV[, -1:-2], importance = T)
# RFs_total <- rfPermute(shannon ~ . , ENV[, -1:-2])
# RF_total;importance(RFs_total)[,1:2]
# # 这里的%Var explained就是我们画柱状图时柱子的高度,大家可以整理下来
# # importance函数调取了每个变量对shannon指数的重要性,用它们定义热图中空心圆的大小

# 构建自定义函数进行并行运算
library(foreach)
library(doParallel)

# 设置核心数
n_cores <- detectCores() - 1
cl <- makeCluster(n_cores)
registerDoParallel(cl)

# 获取生态系统列表
ecosystems <- unique(ENV_new$Ecosystem)

# 设置输出结果名称
options <- list()
options$results <- ecosystems

# 定义一个函数,对每个生态系统运行随机森林分析
run_single_rf <- function(data, eco) {
  if (!"Ecosystem" %in% names(data)) {
    stop("Error: 'Ecosystem' variable not found in data.")
  }
  if (!(eco %in% unique(data$Ecosystem))) {
    stop("Error: Invalid Ecosystem value.")
  }
  # 对当前生态系统的数据进行子集
  eco_data <- subset(data, Ecosystem == eco)[,-1:-2]
  
  # 设置种子
  set.seed(1)
  
  # 进行随机森林分析和置换检验
  RF <- randomForest(shannon~., eco_data, importance = T)
  RF_r2 <- rf.significance(RF, eco_data, nperm=99, ntree=501)$Rsquare
  RF_permuted <- rfPermute(shannon~., eco_data, nperm=99, ntree=501)
  
  # 返回结果
  list(ecosystem = eco, RF = RF, RF_r2 = RF_r2, RF_permuted = RF_permuted)
}

results <- foreach(i = ecosystems, .combine = rbind) %dopar% {
  
  library(randomForest)
  library(rfUtilities)
  library(rfPermute)
  
  # 运行单个随机森林分析
  run_single_rf(data = ENV_new, eco = i)
  
}

# 停止并行处理
stopCluster(cl)
registerDoSEQ()

# 确认结果
head(results)
# ecosystem   RF                      RF_r2     RF_permuted
# result.1 "Whole"     randomForest.formula,18 0.5722844 rfPermute,6
# result.2 "Grassland" randomForest.formula,18 0.6213995 rfPermute,6
# result.3 "Forest"    randomForest.formula,18 0.4347413 rfPermute,6
# result.4 "Cropland"  randomForest.formula,18 0.4695682 rfPermute,6
# result.5 "Wetland"   randomForest.formula,18 0.1822965 rfPermute,6
# result.6 "Desert"    randomForest.formula,18 0.5186782 rfPermute,6

results[, "RF_permuted"][[1]]
# An rfPermute model
# 
# Type of random forest: regression 
# Number of trees: 501 
# No. of variables tried at each split: 4 
# No. of permutation replicates: 100 
# Start time: 2023-04-18 22:56:38 
# End time: 2023-04-18 22:57:05 
# Run time: 26.8 secs 

# 绘制柱状图
names <- c("ecosystem", "RF_r2")
bar <- as.data.frame(results[,names])
bar$ecosystem <- unlist(bar$ecosystem)
bar$RF_r2 <- unlist(bar$RF_r2)
str(bar)
# 'data.frame':	6 obs. of  2 variables:
# $ ecosystem: chr  "Whole" "Grassland" "Forest" "Cropland" ...
# $ RF_r2    : num  0.572 0.621 0.435 0.47 0.182 ...
colnames(bar) <- c("variable", "Exp")
bar$variable<- factor(bar$variable)
barplot<- ggplot(bar, aes(variable, Exp))+
  geom_bar(stat = "identity", fill = "steelblue")+
  scale_y_continuous(expand = c(0,0))+
  theme_classic(base_line_size = 0.75)+
  theme(panel.background = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.y = element_text(color = "black", size = 12))+
  labs(title = "Explained variation (%)", size = 12)
barplot

# 打包6个分类的随机森林模型结果
rf_list <- results[, "RF_permuted"]
# 修改6个分类模型的名称,改为对应的分类
names(rf_list) <- ecosystems
# 生成变量重要性的值,如果p值小于0.05则表示对应变量解释量
important_vars <- do.call(rbind, lapply(seq_along(results[, "RF_permuted"]), function(i) data.frame(ENV = row.names(importance(rf_list[[i]])), 
                                                            Important = ifelse(importance(rf_list[[i]])[, 2] < 0.05, 
                                                                               importance(rf_list[[i]])[, 1], NA),
                                                            Eco = rep(names(rf_list[i]), each = nrow(importance(rf_list[[i]]))))))
# 合并数据
circle <- data.frame(ENV = colnames(ENV)[3:15]) %>%
  left_join(important_vars) %>% 
  melt(id = c("ENV", "Eco"), value.name = "Importance"); circle
# Joining, by = "ENV"
# ENV       Eco  variable Importance
# 1  SOC     Whole Important         NA
# 2  SOC Grassland Important         NA
# 3  SOC    Forest Important  13.227866
# 4  SOC  Cropland Important         NA
# 5  SOC   Wetland Important         NA
# 6  SOC    Desert Important         NA
# 7   pH     Whole Important         NA
# 8   pH Grassland Important         NA
# 9   pH    Forest Important         NA
# 10  pH  Cropland Important         NA
# 11  pH   Wetland Important         NA
# 12  pH    Desert Important         NA
# 13 CEC     Whole Important         NA
# 14 CEC Grassland Important         NA
# 15 CEC    Forest Important         NA
# 16 CEC  Cropland Important   6.204274
# 17 CEC   Wetland Important         NA
# 18 CEC    Desert Important         NA
# 19  TP     Whole Important         NA
# 20  TP Grassland Important         NA
# 21  TP    Forest Important         NA
# 22  TP  Cropland Important         NA
# 23  TP   Wetland Important         NA
# 24  TP    Desert Important         NA
# 25  TK     Whole Important         NA
# 26  TK Grassland Important         NA
# 27  TK    Forest Important         NA
# 28  TK  Cropland Important   6.969350
# 29  TK   Wetland Important         NA
# 30  TK    Desert Important   6.783699
# 31  AK     Whole Important         NA
# 32  AK Grassland Important         NA
# 33  AK    Forest Important         NA
# 34  AK  Cropland Important   7.491691
# 35  AK   Wetland Important   7.208659
# 36  AK    Desert Important   6.601446
# 37  TN     Whole Important         NA
# 38  TN Grassland Important         NA
# 39  TN    Forest Important   6.953556
# 40  TN  Cropland Important         NA
# 41  TN   Wetland Important         NA
# 42  TN    Desert Important         NA
# 43 NO3     Whole Important   5.075570
# 44 NO3 Grassland Important         NA
# 45 NO3    Forest Important         NA
# 46 NO3  Cropland Important         NA
# 47 NO3   Wetland Important         NA
# 48 NO3    Desert Important         NA
# 49 NH4     Whole Important   4.594003
# 50 NH4 Grassland Important         NA
# 51 NH4    Forest Important   6.911031
# 52 NH4  Cropland Important         NA
# 53 NH4   Wetland Important         NA
# 54 NH4    Desert Important         NA
# 55 DOC     Whole Important  14.319739
# 56 DOC Grassland Important  18.475196
# 57 DOC    Forest Important  13.514009
# 58 DOC  Cropland Important         NA
# 59 DOC   Wetland Important         NA
# 60 DOC    Desert Important   5.634245
# 61 MAT     Whole Important         NA
# 62 MAT Grassland Important         NA
# 63 MAT    Forest Important         NA
# 64 MAT  Cropland Important         NA
# 65 MAT   Wetland Important         NA
# 66 MAT    Desert Important         NA
# 67  SD     Whole Important         NA
# 68  SD Grassland Important         NA
# 69  SD    Forest Important   6.683399
# 70  SD  Cropland Important   5.815041
# 71  SD   Wetland Important         NA
# 72  SD    Desert Important         NA
# 73  Pl     Whole Important  26.440467
# 74  Pl Grassland Important  12.635259
# 75  Pl    Forest Important         NA
# 76  Pl  Cropland Important  12.873442
# 77  Pl   Wetland Important   7.492399
# 78  Pl    Desert Important  15.139708

circle$ENV<- factor(circle$ENV, rev(colnames(ENV)[3:15]))

# 划分不同生态系统的数据集
df_eco <- split(ENV[, -1:-2], ENV$Ecosystem)
names(df_eco)

# 计算不同生态系统的相关性
r <- NULL
r <- data.frame(ENV = colnames(ENV)[3:15],
               Whole = (cor(ENV[, -1:-2]))[14, 1:13],
               Grassland = (cor(df_eco$Grassland))[14, 1:13],
               Forest = (cor(df_eco$Forest))[14, 1:13],
               Cropland = (cor(df_eco$Cropland))[14, 1:13],
               Wetland = (cor(df_eco$Wetland))[14, 1:13],
               Desert = (cor(df_eco$Desert))[14, 1:13])%>%
  melt(id = "ENV", value.name = "Correlation");r
# ENV  variable  Correlation
# 1  SOC     Whole  0.190880826
# 2   pH     Whole -0.047988531
# 3  CEC     Whole  0.375623831
# 4   TP     Whole  0.204011032
# 5   TK     Whole  0.282682786
# 6   AK     Whole -0.238701400
# 7   TN     Whole  0.205713695
# 8  NO3     Whole -0.213326858
# 9  NH4     Whole -0.076868083
# 10 DOC     Whole  0.417261347
# 11 MAT     Whole -0.336919922
# 12  SD     Whole -0.272103426
# 13  Pl     Whole  0.590955861
# 14 SOC Grassland  0.389609962
# 15  pH Grassland -0.292803759
# 16 CEC Grassland  0.466440818
# 17  TP Grassland  0.240020091
# 18  TK Grassland  0.250093957
# 19  AK Grassland  0.295479703
# 20  TN Grassland  0.286305242
# 21 NO3 Grassland  0.060433644
# 22 NH4 Grassland -0.179338677
# 23 DOC Grassland  0.449023819
# 24 MAT Grassland -0.445949001
# 25  SD Grassland -0.511629899
# 26  Pl Grassland  0.657737041
# 27 SOC    Forest  0.419911131
# 28  pH    Forest -0.259555477
# 29 CEC    Forest  0.441782003
# 30  TP    Forest -0.043325378
# 31  TK    Forest  0.184701577
# 32  AK    Forest  0.314240260
# 33  TN    Forest  0.394795298
# 34 NO3    Forest  0.191051827
# 35 NH4    Forest -0.062053889
# 36 DOC    Forest  0.472024615
# 37 MAT    Forest -0.246051709
# 38  SD    Forest -0.427179574
# 39  Pl    Forest  0.445715297
# 40 SOC  Cropland  0.190504600
# 41  pH  Cropland  0.079739140
# 42 CEC  Cropland  0.372432772
# 43  TP  Cropland  0.173902720
# 44  TK  Cropland  0.339013659
# 45  AK  Cropland -0.567700216
# 46  TN  Cropland  0.179234867
# 47 NO3  Cropland -0.426516867
# 48 NH4  Cropland  0.047660104
# 49 DOC  Cropland  0.431862312
# 50 MAT  Cropland -0.388846551
# 51  SD  Cropland -0.129670246
# 52  Pl  Cropland  0.693855669
# 53 SOC   Wetland  0.062759256
# 54  pH   Wetland -0.230794196
# 55 CEC   Wetland  0.292344133
# 56  TP   Wetland  0.170356678
# 57  TK   Wetland  0.265791237
# 58  AK   Wetland  0.323268430
# 59  TN   Wetland -0.004741411
# 60 NO3   Wetland  0.172574329
# 61 NH4   Wetland -0.160782185
# 62 DOC   Wetland  0.436793024
# 63 MAT   Wetland -0.386445255
# 64  SD   Wetland -0.189241202
# 65  Pl   Wetland  0.343211982
# 66 SOC    Desert  0.008834840
# 67  pH    Desert  0.298885289
# 68 CEC    Desert  0.155105965
# 69  TP    Desert  0.253336024
# 70  TK    Desert  0.485088124
# 71  AK    Desert -0.554681227
# 72  TN    Desert  0.173607814
# 73 NO3    Desert -0.256845001
# 74 NH4    Desert -0.144874726
# 75 DOC    Desert  0.443678302
# 76 MAT    Desert -0.263961749
# 77  SD    Desert -0.160875488
# 78  Pl    Desert  0.648853903
r$ENV <- factor(r$ENV, levels = rev(colnames(ENV)[3:15]))

# 合并相关性与随机森林重要性的结果
r <- r %>% left_join(circle[c("ENV", "Importance")], by = "ENV");head(r)
# ENV variable Correlation Importance
# 1 SOC    Whole   0.1908808         NA
# 2 SOC    Whole   0.1908808         NA
# 3 SOC    Whole   0.1908808   13.22787
# 4 SOC    Whole   0.1908808         NA
# 5 SOC    Whole   0.1908808         NA
# 6 SOC    Whole   0.1908808         NA

heatmap <- ggplot()+
  geom_tile(data = r, aes(x = variable, y = ENV, fill = Correlation))+
  scale_fill_gradientn(colors = c('#2D6DB1', 'white', '#DC1623'),
                       limit = c(-1, 1))+
  geom_point(data = circle, aes(x = Eco, y = ENV,
                                size = Importance), shape = 21)+
  theme_bw()+
  theme(panel.background = element_blank(),
        panel.grid = element_blank(),
        axis.text.x = element_text(angle = 45, color = "black",
                                   size = 12, vjust = 0.6),
        axis.text.y = element_text(color = 'black', size = 12),
        legend.title = element_text(size = 10))+
  labs(y = '', x = '')
heatmap

# barplot + heatmap +
#   plot_layout(ncol = 1, heights = c(1, 6))
# graph2ppt(file = "ppt.ppt", height = 8, width = 5, append = T)


# barplot + heatmap + 
#   plot_layout(ncol = 1, heights = c(1, 6))
barplot %>%
  aplot::insert_bottom(heatmap, height = 6)
graph2ppt(file = "ppt.ppt", height = 8, width = 5, append = T)
04-20 17:09