问题描述
我有两个数据帧,带有参考数据的df1
和带有新数据的df2
.对于df2
中的每一行,我都需要在汉明距离方面找到与df1
匹配的最佳(和第二最佳)行.
I have two data frames, df1
with reference data and df2
with new data. For each row in df2
, I need to find the best (and the second best) matching row to df1
in terms of hamming distance.
我使用了e1071
包来计算汉明距离.例如,可以计算两个向量x
和y
之间的汉明距离:
I used e1071
package to compute hamming distance. Hamming distance between two vectors x
and y
can be computed as for example:
x <- c(356739, 324074, 904133, 1025460, 433677, 110525, 576942, 526518, 299386,
92497, 977385, 27563, 429551, 307757, 267970, 181157, 3796, 679012, 711274,
24197, 610187, 402471, 157122, 866381, 582868, 878)
y <- c(356739, 324042, 904133, 959893, 433677, 110269, 576942, 2230, 267130,
92496, 960747, 28587, 429551, 438825, 267970, 181157, 36564, 677220,
711274, 24485, 610187, 404519, 157122, 866413, 718036, 876)
xm <- sapply(x, intToBits)
ym <- sapply(y, intToBits)
distance <- sum(sapply(1:ncol(xm), function(i) hamming.distance(xm[,i], ym[,i])))
,结果距离为25.但是,我需要对df1
和df2
的所有行执行此操作.一个简单的方法需要一个双循环嵌套,而且看起来非常慢.
and the resulting distance is 25. Yet I need to do this for all rows of df1
and df2
. A trivial method takes a double loop nest and looks terribly slow.
任何想法如何更有效地做到这一点?最后,我需要附加到df2
:
Any ideas how to do this more efficiently? In the end I need to append to df2
:
- 具有来自
df1
的行ID的列,给出了最小的距离; - 距离最小的列;
- 行ID从
df1
开始的列给出了第二最低的距离; - 距离第二小的列.
- a column with the row id from
df1
that gives the lowest distance; - a column with the lowest distance;
- a column with the row id from
df1
that gives the 2nd lowest distance; - a column with the second lowest distance.
谢谢.
推荐答案
快速计算两个等长整数向量之间的汉明距离
正如我在评论中所说,我们可以做到:
As I said in my comment, we can do:
hmd0 <- function(x,y) sum(as.logical(xor(intToBits(x),intToBits(y))))
计算等长的两个整数矢量 x
和y
之间的汉明距离.这仅使用R base,但比e1071::hamming.distance
更有效,因为因为它是矢量化的!
to compute hamming distance between two integers vectors of equal length x
and y
. This only uses R base, yet is more efficient than e1071::hamming.distance
, because it is vectorized!
对于您帖子中的示例x
和y
,该结果为25.(我的其他答案将显示如果我们需要成对的汉明距离,我们应该怎么做.)
For the example x
and y
in your post, this gives 25. (My other answer will show what we should do, if we want pairwise hamming distance.)
矩阵与向量之间的快速汉明距离
如果要计算单个y
与多个x
之间的汉明距离,即向量与矩阵之间的汉明距离,可以使用以下函数.
If we want to compute the hamming distance between a single y
and multiple x
s, i.e., the hamming distance between a vector and a matrix, we can use the following function.
hmd <- function(x,y) {
rawx <- intToBits(x)
rawy <- intToBits(y)
nx <- length(rawx)
ny <- length(rawy)
if (nx == ny) {
## quick return
return (sum(as.logical(xor(rawx,rawy))))
} else if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
nc <- nx / ny ## number of cycles
return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
}
}
请注意:
-
hmd
进行计算.它被设计为 CPU缓存友好.这样,如果要进行逐行计算,则应首先转置矩阵; - 这里没有明显的循环;相反,我们使用
tapply()
.
hmd
performs computation column-wise. It is designed to be CPU cache friendly. In this way, if we want to do some row-wise computation, we should transpose the matrix first;- there is no obvious loop here; instead, we use
tapply()
.
两个矩阵/数据帧之间的快速汉明距离计算
这就是您想要的.以下函数foo
获取两个数据帧或矩阵df1
和df2
,计算df1
与df2
的每一行之间的距离.参数p
是一个整数,显示要保留多少个结果. p = 3
的行ID位于df1
时,将保持最小的3个距离.
This is what you want. The following function foo
takes two data frames or matrices df1
and df2
, computing the distance between df1
and each row of df2
. argument p
is an integer, showing how many results you want to retain. p = 3
will keep the smallest 3 distances with their row ids in df1
.
foo <- function(df1, df2, p) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
k <- 1:p
for (i in 1:n) {
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
k <- k + p
}
## recode "id" and "d" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
list(id = id, d = d)
}
请注意:
- 根据之前的原因在开始时进行换位;
- 此处使用
for
循环.但这实际上是有效的,因为在每次迭代中都要进行大量的计算.比起使用*apply
系列,它还更加优雅,因为我们要求提供多个输出(行IDid
和距离d
).
- transposition is done at the beginning, according to reasons before;
- a
for
loop is used here. But this is actually efficient because there is considerable computation done in each iteration. It is also more elegant than using*apply
family, since we ask for multiple output (row idid
and distanced
).
实验
这部分使用小的数据集来测试/演示我们的功能.
This part uses small dataset to test/demonstrate our functions.
一些玩具数据:
set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2)) ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2)) ## 3 rows 2 cols
首先测试hmd
(需要换位):
Test hmd
first (needs transposition):
hmd(t(as.matrix(df1)), df2[1, ]) ## df1 & first row of df2
# [1] 2 4 6 2 4
测试foo
:
foo(df1, df2, p = 2)
# $id
# min1 min2
# 1 1 4
# 2 2 3
# 3 5 2
# $d
# mindist.1 mindist.2
# 1 2 2
# 2 1 3
# 3 1 3
如果要将某些列附加到df2
,您知道该怎么做,对吗?
If you want to append some columns to df2
, you know what to do, right?
这篇关于计算两个整数矩阵/数据帧的所有行之间的成对汉明距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!