使用igraph采样不同大小的子图

本文介绍了使用igraph采样不同大小的子图的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个igraph对象mygraph，具有约10,000个节点和约145,000个边，并且需要从该图创建许多子图，但大小不同.我需要的是根据确定的大小(从5个节点到500个节点)创建子图，其中每个子图中的所有节点均已连接.我需要为每个大小创建约1,000个子图(即，大小5的1000个子图，大小6的1000个子图，依此类推)，然后根据不同的节点属性为每个图计算一些值.我有一些代码，但是所有的计算都需要很长时间.我曾想过使用graphlets函数来获取不同的大小，但是每次在计算机上运行它时，由于内存问题而崩溃.

I have an igraph object mygraph with ~10,000 nodes and ~145,000 edges, and I need to create a number of subgraphs from this graph but with different sizes.What I need is to create subgraphs from a determined size (from 5 nodes to 500 nodes) where all the nodes are connected in each subgraph. I need to create ~1,000 subgraphs for each size (i.e, 1000 subgraphs for size5, 1000 for size 6, and so on), and then calculate some values for each graph according to different node attributes.I have some code but it takes a long time to do all the calculations. I thought in using the graphlets function in order to get the different sizes but every time I run it on my computer it crash due to memory issues.

这是我正在使用的代码:

Here is the code I am using:

第一步是创建一个函数，以创建不同大小的子图并进行所需的计算.

First step was to create a function to create the subgraphs of different sizes and do the calculations needed.

random_network<-function(size,G){
     score_fun<-function(g){
          subsum <- sum(V(g)$weight*V(g)$RWRNodeweight)/sqrt(sum(V(g)$RWRNodeweight^2))
           subsum
           }

      genes.idx <- V(G)$name
      perm <- c()
      while(length(perm)<1000){
           seed<-sample(genes.idx,1)
           while( length(seed)<size ){
                tmp.neigh <- V(G)[unlist(neighborhood(G,1,seed))]$name
                tmp.neigh <- setdiff(tmp.neigh, seed)
                if( length(tmp.neigh)>0 )
                seed<-c(seed,sample(tmp.neigh,1)) else break
            }
      if( length(seed)==size )
      perm <- c(perm,score_fun(induced.subgraph(G,seed)))
      }
      perm
     }

第二步是将该功能应用于实际图形

Second step was to apply the function to the actual graph

 ### generate some example data
 library(igraph)
 my_graph <- erdos.renyi.game(10000, 0.0003)
 V(my_graph)$name <- 1:vcount(my_graph)
 V(my_graph)$weight <- rnorm(10000)
 V(my_graph)$RWRNodeweight <- runif(10000, min=0, max=0.05)

 ### Run the code to get the subgraphs from different size and do calculations based on nodes
 genesets.length<- seq(5:500)
 genesets.length.null.dis <- list()
 for(k in 5:max(genesets.length){
     genesets.length.null.dis[[as.character(k)]] <- random_network(size=k,G=my_graph)
  }

推荐答案

使用Rcpp软件包是一种比Base R进一步提高代码速度的方法.考虑以下完整操作的Rcpp实现.输入以下内容:

One approach to speed up your code further than what's possible in base R would be to use the Rcpp package. Consider the following Rcpp implementation of the full operation. It takes as input the following:

valid:足够大的组件中所有节点的索引
el，deg，firstPos:图形的边缘列表的表示形式(节点i的邻居是el[firstPos[i]]，el[firstPos[i]+1]，...，el[firstPos[i]+deg[i]-1]).
size:要采样的子图大小
nrep:重复次数
weights:存储在V(G)$weight
RWRNodeweight:存储在V(G)$RWRNodeweight

valid: The indices of all nodes that are in a large-enough component
el, deg, firstPos: A representation of the graph's edge list (node i's neighbors are el[firstPos[i]], el[firstPos[i]+1], ..., el[firstPos[i]+deg[i]-1]).
size: The subgraph size to sample
nrep: The number of repetitions
weights: The edge weights stored in V(G)$weight
RWRNodeweight: The edge weights stored in V(G)$RWRNodeweight

library(Rcpp)
cppFunction(
"NumericVector scores(IntegerVector valid, IntegerVector el, IntegerVector deg,
                      IntegerVector firstPos, const int size, const int nrep,
                      NumericVector weights, NumericVector RWRNodeweight) {
  const int n = deg.size();
  std::vector<bool> used(n, false);
  std::vector<bool> neigh(n, false);
  std::vector<int> neighList;
  std::vector<double> scores(nrep);
  for (int outerIter=0; outerIter < nrep; ++outerIter) {
    // Initialize variables
    std::fill(used.begin(), used.end(), false);
    std::fill(neigh.begin(), neigh.end(), false);
    neighList.clear();

    // Random first node
    int recent = valid[rand() % valid.size()];
    used[recent] = true;
    double wrSum = weights[recent] * RWRNodeweight[recent];
    double rrSum = RWRNodeweight[recent] * RWRNodeweight[recent];

    // Each additional node
    for (int idx=1; idx < size; ++idx) {
      // Add neighbors of recent
      for (int p=firstPos[recent]; p < firstPos[recent] + deg[recent]; ++p) {
        if (!neigh[el[p]] && !used[el[p]]) {
          neigh[el[p]] = true;
          neighList.push_back(el[p]);
        }
      }

      // Compute new node to add from all neighbors
      int newPos = rand() % neighList.size();
      recent = neighList[newPos];
      used[recent] = true;
      wrSum += weights[recent] * RWRNodeweight[recent];
      rrSum += RWRNodeweight[recent] * RWRNodeweight[recent];

      // Remove from neighList
      neighList[newPos] = neighList[neighList.size() - 1];
      neighList.pop_back();
    }

    // Compute score from wrSum and rrSum
    scores[outerIter] = wrSum / sqrt(rrSum);
  }
  return NumericVector(scores.begin(), scores.end());
}
")

现在我们在基数R中需要做的就是为scores生成参数，这很容易做到:

Now all that we need to do in base R is generate the arguments for scores, which can be done pretty easily:

josilber.rcpp <- function(size, num.rep, G) {
  n <- length(V(G)$name)

  # Determine which nodes fall in sufficiently large connected components
  comp <- components(G)
  valid <- which(comp$csize[comp$membership] >= size)

  # Construct an edge list representation for use in the Rcpp code
  el <- get.edgelist(G, names=FALSE) - 1
  el <- rbind(el, el[,2:1])
  el <- el[order(el[,1]),]
  deg <- degree(G)
  first.pos <- c(0, cumsum(head(deg, -1)))

  # Run the proper number of replications
  scores(valid-1, el[,2], deg, first.pos, size, num.rep,
         as.numeric(V(G)$weight), as.numeric(V(G)$RWRNodeweight))
}

与原始代码和迄今为止我们看到的所有igraph解决方案相比，执行1000次复制的时间非常快(请注意，对于大多数基准测试，我已经针对以下测试了原始的josilber和random_network函数) 1次复制而不是1000次复制，因为测试1000次会花费非常长的时间):

The time to perform 1000 replications is blazing fast compared to the original code and all igraph solutions we've seen so far (note that for much of this benchmarking I tested the original josilber and random_network functions for 1 replication instead of 1000 because testing for 1000 would take a prohibitively long time):

大小= 10:0.06秒(比我先前建议的josilber函数高1200倍，比原始random_network函数高4000倍)
大小= 100:0.08秒(比我以前建议的josilber函数高8700倍，比原始random_network函数高162000倍)
大小= 1000:0.13秒(比我以前建议的josilber函数快32000倍，比原始random_network函数快 2040万倍)
大小= 5000:0.32秒(比我以前建议的josilber函数快68000倍，比原始random_network函数快 2.9亿倍)

Size=10: 0.06 seconds (a 1200x speedup over my previously proposed josilber function and a 4000x speedup over the original random_network function)
Size=100: 0.08 seconds (a 8700x speedup over my previously proposed josilber function and a 162000x speedup over the original random_network function)
Size=1000: 0.13 seconds (a 32000x speedup over my previously proposed josilber function and a 20.4 million times speedup over the original random_network function)
Size=5000: 0.32 seconds (a 68000x speedup over my previously proposed josilber function and a 290 million times speedup over the original random_network function)

简而言之，Rcpp可能使得对于从5到500的每个大小计算1000个重复项是可行的(此操作可能总共花费大约1分钟)，而为每个大小计算1000个重复项将太慢了使用到目前为止已经提出的纯R代码来确定大小.

In short, Rcpp probably makes it feasible to compute 1000 replicates for each size from 5 to 500 (this operation will probably take about 1 minute in total), while it would be prohibitively slow to compute the 1000 replicates for each of these sizes using the pure R code that's been proposed so far.

这篇关于使用igraph采样不同大小的子图的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！