计算两个整数矩阵/数据帧的所有行之间的成对汉明距离

本文介绍了计算两个整数矩阵/数据帧的所有行之间的成对汉明距离的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个数据帧，带有参考数据的df1和带有新数据的df2.对于df2中的每一行，我都需要在汉明距离方面找到与df1匹配的最佳(和第二最佳)行.

I have two data frames, df1 with reference data and df2 with new data. For each row in df2, I need to find the best (and the second best) matching row to df1 in terms of hamming distance.

我使用了e1071包来计算汉明距离.例如，可以计算两个向量x和y之间的汉明距离:

I used e1071 package to compute hamming distance. Hamming distance between two vectors x and y can be computed as for example:

x <- c(356739, 324074, 904133, 1025460, 433677, 110525, 576942, 526518, 299386,
       92497, 977385, 27563, 429551, 307757, 267970, 181157, 3796, 679012, 711274,
       24197, 610187, 402471, 157122, 866381, 582868, 878)

y <- c(356739, 324042, 904133, 959893, 433677, 110269, 576942, 2230, 267130,
       92496, 960747, 28587, 429551, 438825, 267970, 181157, 36564, 677220,
       711274, 24485, 610187, 404519, 157122, 866413, 718036, 876)

xm <- sapply(x, intToBits)
ym <- sapply(y, intToBits)

distance <- sum(sapply(1:ncol(xm), function(i) hamming.distance(xm[,i], ym[,i])))

，结果距离为25.但是，我需要对df1和df2的所有行执行此操作.一个简单的方法需要一个双循环嵌套，而且看起来非常慢.

and the resulting distance is 25. Yet I need to do this for all rows of df1 and df2. A trivial method takes a double loop nest and looks terribly slow.

任何想法如何更有效地做到这一点?最后，我需要附加到df2:

Any ideas how to do this more efficiently? In the end I need to append to df2:

具有来自df1的行ID的列，给出了最小的距离；
距离最小的列；
行ID从df1开始的列给出了第二最低的距离；
距离第二小的列.

a column with the row id from df1 that gives the lowest distance;
a column with the lowest distance;
a column with the row id from df1 that gives the 2nd lowest distance;
a column with the second lowest distance.

谢谢.

推荐答案

快速计算两个等长整数向量之间的汉明距离

正如我在评论中所说，我们可以做到:

As I said in my comment, we can do:

hmd0 <- function(x,y) sum(as.logical(xor(intToBits(x),intToBits(y))))

计算等长的两个整数矢量 x和y之间的汉明距离.这仅使用R base，但比e1071::hamming.distance更有效，因为因为它是矢量化的！

to compute hamming distance between two integers vectors of equal length x and y. This only uses R base, yet is more efficient than e1071::hamming.distance, because it is vectorized!

对于您帖子中的示例x和y，该结果为25.(我的其他答案将显示如果我们需要成对的汉明距离，我们应该怎么做.)

For the example x and y in your post, this gives 25. (My other answer will show what we should do, if we want pairwise hamming distance.)

矩阵与向量之间的快速汉明距离

如果要计算单个y与多个x之间的汉明距离，即向量与矩阵之间的汉明距离，可以使用以下函数.

If we want to compute the hamming distance between a single y and multiple xs, i.e., the hamming distance between a vector and a matrix, we can use the following function.

hmd <- function(x,y) {
  rawx <- intToBits(x)
  rawy <- intToBits(y)
  nx <- length(rawx)
  ny <- length(rawy)
  if (nx == ny) {
    ## quick return
    return (sum(as.logical(xor(rawx,rawy))))
    } else if (nx < ny) {
    ## pivoting
    tmp <- rawx; rawx <- rawy; rawy <- tmp
    tmp <- nx; nx <- ny; ny <- tmp
    }
  if (nx %% ny) stop("unconformable length!") else {
    nc <- nx / ny  ## number of cycles
    return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
    }
  }

请注意:

hmd 进行计算.它被设计为 CPU缓存友好.这样，如果要进行逐行计算，则应首先转置矩阵；
这里没有明显的循环；相反，我们使用tapply().

hmd performs computation column-wise. It is designed to be CPU cache friendly. In this way, if we want to do some row-wise computation, we should transpose the matrix first;
there is no obvious loop here; instead, we use tapply().

两个矩阵/数据帧之间的快速汉明距离计算

这就是您想要的.以下函数foo获取两个数据帧或矩阵df1和df2，计算df1与df2的每一行之间的距离.参数p是一个整数，显示要保留多少个结果. p = 3的行ID位于df1时，将保持最小的3个距离.

This is what you want. The following function foo takes two data frames or matrices df1 and df2, computing the distance between df1 and each row of df2. argument p is an integer, showing how many results you want to retain. p = 3 will keep the smallest 3 distances with their row ids in df1.

foo <- function(df1, df2, p) {
  ## check p
  if (p > nrow(df2)) p <- nrow(df2)
  ## transpose for CPU cache friendly code
  xt <- t(as.matrix(df1))
  yt <- t(as.matrix(df2))
  ## after transpose, we compute hamming distance column by column
  ## a for loop is decent; no performance gain from apply family
  n <- ncol(yt)
  id <- integer(n * p)
  d <- numeric(n * p)
  k <- 1:p
  for (i in 1:n) {
    distance <- hmd(xt, yt[,i])
    minp <- order(distance)[1:p]
    id[k] <- minp
    d[k] <- distance[minp]
    k <- k + p
    }
  ## recode "id" and "d" into data frame and return
  id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
  colnames(id) <- paste0("min.", 1:p)
  d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
  colnames(d) <- paste0("mindist.", 1:p)
  list(id = id, d = d)
  }

请注意:

根据之前的原因在开始时进行换位；
此处使用for循环.但这实际上是有效的，因为在每次迭代中都要进行大量的计算.比起使用*apply系列，它还更加优雅，因为我们要求提供多个输出(行ID id和距离d).

transposition is done at the beginning, according to reasons before;
a for loop is used here. But this is actually efficient because there is considerable computation done in each iteration. It is also more elegant than using *apply family, since we ask for multiple output (row id id and distance d).

实验

这部分使用小的数据集来测试/演示我们的功能.

This part uses small dataset to test/demonstrate our functions.

一些玩具数据:

set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2))  ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2))  ## 3 rows 2 cols

首先测试hmd(需要换位):

Test hmd first (needs transposition):

hmd(t(as.matrix(df1)), df2[1, ])  ## df1 & first row of df2
# [1] 2 4 6 2 4

测试foo:

foo(df1, df2, p = 2)

# $id
#   min1 min2
# 1    1    4
# 2    2    3
# 3    5    2

# $d
#   mindist.1 mindist.2
# 1         2         2
# 2         1         3
# 3         1         3

如果要将某些列附加到df2，您知道该怎么做，对吗?

If you want to append some columns to df2, you know what to do, right?

这篇关于计算两个整数矩阵/数据帧的所有行之间的成对汉明距离的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！