本文介绍了R创建边列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

即使我一直在检查你们发布和回答的内容已经很长时间了(只是没有时间参加我的硕士课程而已),这还是Stack Overflow的新手.

New to the Stack Overflow, even though I've been checking what you guys post and answer for quite some time (just didn't have the time to join up while working on my Master's).

TL; DR:我使用此处介绍的脚本来处理结构如下的数据集,以获取网络边缘.它可以工作,但是处理时间太长(2k行需要24小时).对初学者R用户有什么提示使其更快?

TL;DR: I used the script presented here to process a dataset structured like the one below to get edges for a network. It worked but took way too long to process (24h for 2k rows). Any tip to a beginner R user on making it faster?

在上一个研究项目中,我最终得到了一个data.frame,类似于:

In my last research project I ended up with a data.frame much like:

ID | Trait 1 | Trait 2 | Trait 3 | Trait 4 |  Trait 5
01 |  TRUE   |   TRUE  |  Photo  |   City  |  Portrait
02 |  FALSE  |   TRUE  |  Draw   |  Child  |  Portrait
03 |  TRUE   |  FALSE  |  Photo  |   Misc  |  Landscape
.
.
.

然后继续进行约2k行.目的是建立一个网络,其中每个ID是一个节点,并且两个ID之间的共同特征之和将构成一个加权边,即ID 01相对于ID 2和3都具有权重2边缘,而ID 2将具有权重2边缘.没有ID 3的优势.

And this went on for some 2k rows. The intent was to build a network where each ID is a node, and the sum of common traits between 2 IDs would make up a weighted edge, i.e. ID 01 would have a weight 2 edge to both ID 2 and 3, while ID 2 would not have an edge to ID 3.

为了解决这个问题,我使用了以下脚本,该脚本在每一行中运行,比较每个列的值以增加权重(每个匹配= +1),而忽略已比较的行(由于是无向网络,因此不需要两种方式都进行匹配) :

To work this out I used the following script that runs through each row comparing each column value to add weight (each match = +1), ignoring rows already compared (being an undirected network, it was unnecessary to match both ways):

键:来源=要比较的ID;目标=正在比较的ID;权重=匹配的单元格/特征的总和.

Key: Source = ID to compare to; Target = ID being compared; Weight = Sum of matching cells/traits.

findEdges <- function(){
    input <- read.csv("nodes.csv",header=TRUE,stringsAsFactors=FALSE,sep=";")
    edges <- read.csv("edges.csv",header=TRUE,stringsAsFactor=FALSE,skip=1,colClasses=c("integer","integer","integer"),col.names=c("Source","Target","Weight"))
    for(i in 1:nrow(input)){ #row to be compared: Source
        for(j in 1:nrow(entrada)){ #rows that will compare to: Target
            weight <- 0
            if( i >= j ){
            } else {
                for(k in 1:ncol(input)){ #column by column comparison
                    col <- k
                    if(input[i,k] == input[j,k]){ #edge weight modifier
                        weight <- weight+1
                        }
                }
                print(c("source= ",i,"target= ",j,"weight= ",weight)) #visual feedback of running script
                newRow <- data.frame(Source=i,Target=j,Weight=weight) #create row for compared pair
                edges <- rbind(edges,newRow) # add edge row to data frame
            }
        }
    }
    write.csv(edges,"edges.csv") #write data frame to csv file
}
findEdges()

效果很好,并给了我所需的加权边缘列表.边缘列表的每一行都将显示为:

which worked just fine and gave me the weighted edgelist I needed. Each row of the edgelist would be presentes as:

Source | Target | Weight
  01   |   02   |   2
  01   |   03   |   2

以此类推...

但是,此脚本花费了将近24小时来处理整个数据集(2k行,除ID以外的5列),尽管以前这不是问题,但我认为最好检查一下一些关于更好/更快的提示的方法达到相同结果的方法.

However, this script took almost 24h to process the entire dataset (2k rows, 5 columns except ID), and while that was not an issue before, I think it would be nice to check for some tips on a better/faster way to achieve the same results.

推荐答案

一种方法是分别处理每一列,在每一行之间生成成对相似矩阵.例如,假设我们在单个列上进行操作:

One approach would be to process each column separately, generating pairwise similarity matrix between each of the rows. For instance, let's pretend we're operating on a single column:

col <- c(1, 1, 2, 3, 2, 4)
outer(col, col, "==") * 1
#      [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    1    1    0    0    0    0
# [2,]    1    1    0    0    0    0
# [3,]    0    0    1    0    1    0
# [4,]    0    0    0    1    0    0
# [5,]    0    0    1    0    1    0
# [6,]    0    0    0    0    0    1

outer函数在每对元素之间执行我们的运算符(==),返回矩阵(*1只是将TRUE/FALSE转换为0/1).一个不错的方面是,它是向量化运算符,因此与涉及for循环的方法相比,它可以非常快速地工作.

The outer function performs our operator (==) between each pair of elements, returning the matrix (the *1 is just to convert TRUE/FALSE to 0/1). One nice aspect is that this is a vectorized operator so it will work very quickly compared to an approach involving a for loop.

现在,很显然,我们需要做的是为每一列获取一个相似度矩阵并将它们加起来.

Now, it's clear that all we need to do is get a similarity matrix for each column and add them all up.

(dat <- data.frame(ID=c(1, 2, 3), T1=c(F, F, T), T2=c(T, T, F), T3=c("Photo", "Draw", "Photo"), T4=c("City", "Child", "Misc"), T5=c("Portrait", "Portrait", "Landscape")))
#   ID    T1    T2    T3    T4        T5
# 1  1 FALSE  TRUE Photo  City  Portrait
# 2  2 FALSE  TRUE  Draw Child  Portrait
# 3  3  TRUE FALSE Photo  Misc Landscape
(res <- Reduce("+", lapply(2:ncol(dat), function(x) outer(dat[,x], dat[,x], "=="))))
#      [,1] [,2] [,3]
# [1,]    5    3    1
# [2,]    3    5    0
# [3,]    1    0    5

此功能已确定每一行都具有所有5列相同的地方.另外的行1和2共有3个元素,行1和3共有1个元素,而行2和3没有共同的元素.

This function has identified that each row has all 5 columns in common with itself. Further rows 1 and 2 have 3 elements in common, rows 1 and 3 have 1 element in common, and rows 2 and 3 have no elements in common.

您可以轻松地将图形的结尾从宽表示转换为长表示(在这里,我已经过滤掉了源ID>目标ID的自链接和边):

You can easily convert at the end from wide to long representation for the graph (here I've filtered out self-links and edges with source id > target id):

subset(cbind(expand.grid(Source=dat$ID, Target=dat$ID), Weight=as.vector(res)),
       Source < Target)
#   Source Target Weight
# 4      1      2      3
# 7      1      3      1
# 8      2      3      0

基准测试表明,向量化的outer函数相对于for循环具有更大的优势:

Benchmarking shows that the vectorized outer function gives us a big advantage over the for loop:

set.seed(144)
big.dat <- data.frame(ID=1:100, A=sample(letters, 100, replace=T), B=sample(letters, 100, replace=T), C=sample(1:10, 100, replace=T))
OP <- function(dat) {
  edges <- data.frame(Source=c(), Target=c(), Weight=c())
  for (i in 1:nrow(dat)) {
    for (j in 1:nrow(dat)) {
      if (i < j) {
        weight <- 0
        for (k in 2:ncol(dat)) {
          if (dat[i,k] == dat[j,k]) {
            weight <- weight + 1
          }
        }
        edges <- rbind(edges, data.frame(Source=i, Target=j, Weight=weight))
      }
    }
  }
  edges
}
josilber <- function(dat) {
  res <- Reduce("+", lapply(2:ncol(dat), function(x) outer(dat[,x], dat[,x], "==")))
  ret <- subset(cbind(expand.grid(Source=dat$ID, Target=dat$ID), Weight=as.vector(res)), Source < Target)
  # Changes to exactly match OP's output
  ret <- ret[order(ret$Source, ret$Target),]
  row.names(ret) <- NULL
  ret
}
all.equal(OP(big.dat), josilber(big.dat))
# [1] TRUE
library(microbenchmark)
microbenchmark(OP(big.dat), josilber(big.dat), times=10)
# Unit: milliseconds
#               expr         min          lq        mean      median          uq         max neval
#        OP(big.dat) 5931.354448 6062.872595 6137.497152 6076.736039 6175.002149 6519.977217    10
#  josilber(big.dat)    5.882283    5.914646    6.316981    5.978082    6.368297    8.801991    10

使用矢量化方法,对于100行的示例,我们实现了大约1000倍的加速.

We achieved about a 1000x speedup for the example with 100 rows using the vectorized approach.

这篇关于R创建边列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-03 10:43