本文介绍了(快速)成对比较其元素具有"a/b"的矩阵列.格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的字符矩阵很大(15000 x 150),格式如下:

I have a big character matrix (15000 x 150), and with the following format:

       A     B       C     D
[1,] "0/0" "0/1"   "0/0" "1/1"
[2,] "1/1" "1/1"   "0/1" "0/1"
[3,] "1/2" "0/3"   "1/1" "2/2"
[4,] "0/0" "0/0"   "2/2" "0/0"
[5,] "0/0" "0/0"   "0/0" "0/0"

我需要在列之间进行成对比较,并获得其中行的比例

I need to do pairwise comparison between columns and get the proportion of rows where

  • '/'分隔的两个字符串都不相等(编码为0);
  • 只有一个用'/'分隔的字符串是相等的(编码为1);
  • 两个用'/'分隔的字符串都相等(编码为2).
  • neither string separated by '/' is equal (coded as 0);
  • only one string separated by '/' is equal (coded as 1);
  • both strings separated by '/' are equal (coded as 2).

上述5 x 4样本矩阵的预期输出为

The expected output for the above sample 5 x 4 matrix is

          0    1    2
A    B    0.2  0.2  0.6
A    C    0.2  0.4  0.4
A    D    0.2  0.4  0.4
B    C    0.4  0.4  0.2
B    D    0.2  0.4  0.4
C    D    0.6  0.0  0.4

我尝试使用pmatch,但是无法进行成对比较以获得上述输出.任何帮助表示赞赏.

I have tried using pmatch, however not able to do pairwise comparison to get the above output. any help is appreciated.

修订后的问题

是否可以在两对之间排除值"0/0"以获得比例?即当比较A和B时,排除A = B = 0/0时得到其余部分的比例吗?

Is it possible to exclude the values "0/0" between two pairs to get the proportions? i.e. when A and B are compared exclude when A=B= 0/0 and get the proportions for the rest?

推荐答案

这是我到目前为止可以提供的:

This is what I could provide so far:

fun1 <- function (S) {
  n <- ncol(S)
  ref2 <- combn(colnames(S), 2)
  ref1 <- paste(ref2[1, ], ref2[2, ], sep = "&")
  z <- matrix(0, choose(n, 2), 3L, dimnames = list(ref1, 0:2))
  k <- 1L
  for (j in 1:(n - 1)) {
    x <- scan(text = S[, j], what = integer(), sep = "/", quiet = TRUE)
    for (i in (j + 1):n) {
      y <- scan(text = S[, i], what = integer(), sep = "/", quiet = TRUE)
      count <- tabulate(.colSums(x == y, 2L, length(x) / 2L) + 1L)
      z[k, ] <- count / sum(count)
      k <- k + 1L
      }
    }
  z
  }

它看起来很糟糕,因为它有一个用R编写的双循环嵌套,但是最内部的内核通过使用scan.colSumstabulate极其高效.迭代的总次数为choose(ncol(S), 2),对于150列矩阵而言,迭代次数不多.如果需要,我可以用Rcpp版本替换fun1.

It looks bad as it has a double loop nest written in R, but the innermost kernel is extremely efficient by using scan, .colSums and tabulate. The total number of iterations is choose(ncol(S), 2), not too many for your 150-column matrix. I can replace fun1 by an Rcpp version if you want.

## your data
S <- structure(c("0/0", "1/1", "1/2", "0/0", "0/0", "0/1", "1/1",
"0/3", "0/0", "0/0", "0/0", "0/1", "1/1", "2/2", "0/0", "1/1",
"0/1", "2/2", "0/0", "0/0"), .Dim = c(5L, 4L), .Dimnames = list(
NULL, c("A", "B", "C", "D")))

fun1(S)
#      0   1   2
#A&B 0.2 0.2 0.6
#A&C 0.2 0.4 0.4
#A&D 0.2 0.4 0.4
#B&C 0.4 0.4 0.2
#B&D 0.2 0.4 0.4
#C&D 0.6 0.0 0.4


性能

哈,当我在15000 x 150矩阵上实际测试函数时,我发现:

Ha, when I actually test my function on a 15000 x 150 matrix I found that:

  1. 我可以将scan从循环嵌套中移出以加快速度,也就是说,我可以一次性将字符矩阵扫描为整数矩阵;
  2. scan(text = blabla)花费了很多时间,而scan(file = blabla)却很快,因此值得从文本文件中读取数据;
  3. 使用文本文件对文件格式很敏感,因此编写健壮的代码很棘手.
  1. I could move scan out of the loop nest for speedup, that is, I could scan the character matrix into an integer matrix in one go;
  2. scan(text = blabla) takes forever, while scan(file = blabla) is fast, so it could be worth reading data from a text file;
  3. working with a text file is sensitive to the format of the file so writing a robust code is tricky.

我生成了具有文件访问权限的版本fun2,以及使用Rcpp进行循环嵌套的版本fun3.事实证明:

I produced a version fun2 with file access, and a version fun3 using Rcpp for the loop nest. It turns out that:

  1. 从文件中读取确实更好(但是我们必须以"/"分隔格式提供文件);
  2. 循环的Rcpp实现很有意义.

我回来了并将它们张贴在这里(请参阅版本2 ),我看到了user20650的(以strsplit开头).启动时,我从选项中排除了strsplit,因为我认为使用字符串操作可能很慢.是的,它很慢,但仍然比scan快.因此,我使用strsplitfun5以及Rcpp编写了fun4(请参阅版本3 ).分析表明,执行时间的60%花费在strsplit中,因此它确实是性能杀手.然后,我用一个更简单的C ++实现替换了strsplitunlistas.integermatrix.它产生10倍的提升!!好吧,如果您仔细考虑,这是合理的.通过使用C库<stdlib.h>中的atoi(或strtol),我们可以将字符串直接转换为整数,从而消除了所有字符串操作!

I came back and posted them here (see revision 2), and I saw user20650's starting with strsplit. I excluded strsplit from my option when I started, because I think operation with string can be slow. Yes, it is slow, but still faster than scan. So I wrote a fun4 using strsplit and a corresponding fun5 with Rcpp (see revision 3). Profiling says that 60% of the execution time is spent in strsplit so it is indeed a performance killer. Then I replaced strsplit, unlist, as.integer and matrix with a single, simpler C++ implementation. It yields a 10x boost!! Well, this is reasonable if you think about it carefully. By using atoi (or strtol) from C library <stdlib.h>, we can directly translate strings into integers, so all string operations are eliminated!

长话短说,我只提供最终的,最快的版本.

Long story short, I only provide the final, fastest version.

library(Rcpp)

cppFunction("IntegerMatrix getInt (CharacterMatrix Char) {
  int m = Char.nrow(), n = Char.ncol();
  IntegerMatrix Int(2 * m, n);
  char *s1, *s2;
  int i, *iptr = &Int(0, 0);
  for (i = 0; i < m * n; i++) {
    s1 = (char *)Char[i]; s2 = s1;
    while(*s2 != '/') s2++; *iptr++ = atoi(s1);
    s2++; *iptr++ = atoi(s2);
    }
  return Int;
  }")

cppFunction('NumericMatrix pairwise(NumericMatrix z, IntegerMatrix Int) {
  int m = Int.nrow() / 2, n = Int.ncol();
  int i, j, k, *x, *y, count[3], *end; bool b1 = 0, b2 = 0;
  double M = 1 / (double)m;
  for (k = 0, j = 0; j < (n - 1); j++) {
    end = &Int(2 * m, j);
    for (i = j + 1; i < n; i++, k++) {
      x = &Int(0, j); y = &Int(0, i);
      count[0] = 0; count[1] = 0; count[2] = 0;
      for (; x < end; x += 2, y += 2) {
        b1 = (x[0] == y[0]);
        b2 = (x[1] == y[1]);
        count[(int)b1 + (int)b2]++;
        }
      z(k, 0) = (double)count[0] * M;
      z(k, 1) = (double)count[1] * M;
      z(k, 2) = (double)count[2] * M;
      }
    }
  return z;
  }')

fun7 <- function (S) {
  ## separate rows using Rcpp; `Int` is an integer matrix
  n <- ncol(S)
  Int <- getInt(S)
  m <- nrow(Int) / 2
  ## initialize the resulting matrix `z`
  ref2 <- combn(colnames(S), 2)
  ref1 <- paste(ref2[1, ], ref2[2, ], sep = "&")
  z <- matrix(0, choose(n, 2), 3L, dimnames = list(ref1, 0:2))
  ## use Rcpp for pairwise summary
  pairwise(z, Int)
  }

让我们生成一个随机的15000 x 150矩阵并尝试一下.

Let's generate a random 15000 x 150 matrix and try it.

sim <- function (m, n) {
  matrix(sample(c("0/0", "0/1", "1/0", "1/1"), m * n, TRUE), m, n,
         dimnames = list(NULL, 1:n))
  }

S <- sim(15000, 150)
system.time(oo <- fun7(S))
#   user  system elapsed
#  1.324   0.000   1.325

哦,这很快就亮了!

这种适应在C/C ++级别上很简单.只是一个附加的if测试.

This kind of adaptation is straightforward at C / C++ level. Just an addition if test.

## a new C++ function `pairwise_exclude00`
cppFunction('NumericMatrix pairwise_exclude00(NumericMatrix z, IntegerMatrix Int) {
  int m = Int.nrow() / 2, n = Int.ncol();
  int i, j, k, *x, *y, count[3], size, *end;
  bool b1 = 0, b2 = 0, exclude = 0;
  double M;
  for (k = 0, j = 0; j < (n - 1); j++) {
    end = &Int(2 * m, j);
    for (i = j + 1; i < n; i++, k++) {
      x = &Int(0, j); y = &Int(0, i);
      count[0] = 0; count[1] = 0; count[2] = 0; size = 0;
      for (; x < end; x += 2, y += 2) {
        b1 = (x[0] == y[0]);
        b2 = (x[1] == y[1]);
        exclude = (x[0] == 0) & (x[1] == 0) & b1 & b2;
        if (!exclude) {
          count[(int)b1 + (int)b2]++;
          size++;
          }
        }
      M = 1 / (double)size;
      z(k, 0) = (double)count[0] * M;
      z(k, 1) = (double)count[1] * M;
      z(k, 2) = (double)count[2] * M;
      }
    }
  return z;
  }')

## re-define `fun7` with a new logical argument `exclude00`
fun7 <- function (S, exclude00) {
  ## separate rows using Rcpp; `Int` is an integer matrix
  n <- ncol(S)
  Int <- getInt(S)
  m <- nrow(Int) / 2
  ## initialize the resulting matrix `z`
  ref2 <- combn(colnames(S), 2)
  ref1 <- paste(ref2[1, ], ref2[2, ], sep = "&")
  z <- matrix(0, choose(n, 2), 3L, dimnames = list(ref1, 0:2))
  ## use Rcpp for pairwise summary
  if (exclude00) pairwise_exclude00(z, Int)
  else pairwise(z, Int)
  }

使用问题中的示例S:

fun7(S, TRUE)
#            0         1         2
#A&B 0.3333333 0.3333333 0.3333333
#A&C 0.3333333 0.6666667 0.0000000
#A&D 0.3333333 0.6666667 0.0000000
#B&C 0.5000000 0.5000000 0.0000000
#B&D 0.3333333 0.6666667 0.0000000
#C&D 0.7500000 0.0000000 0.2500000

这篇关于(快速)成对比较其元素具有"a/b"的矩阵列.格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-14 11:23
查看更多