问题描述
我正在尝试计算具有相同列数(变量)和不同行数(观察值)的两个数据帧之间的加权欧几里德距离(平方).
I'm trying to calculate the weighted euclidean distance (squared) between twoo data frames that have the same number of columns (variables) and different number of rows (observations).
计算公式如下:
DIST[m,i] <- sum(((DATA1[m,] - DATA2[i,]) ^ 2) * lambda[1,])
我特别需要将躯体的每个包裹乘以一个特定的权重 (lambda).
I specifically need to multiply each parcel of the somatory by a specific weight (lambda).
下面提供的代码可以正确运行,但如果我在数百次迭代中使用它,则需要大量的处理时间.昨天,我花了 18 个小时使用包含此计算的函数的多次迭代来创建图形.使用 library(profvis) profvis({ my code }) 我看到代码的这个特定部分占用了大约 80% 的处理时间.
The code provided bellow runs correctly, but if I use it in hundreds of iterations it takes a lot of processing time. Yesterday it took me 18 hours to create a graphic using multiple iterations of a function that contains this calculation. Using library(profvis) profvis({ my code }) I saw that this specific part of the code is taking up like 80% of the processing time.
我阅读了很多关于如何使用并行和矢量化操作来减少处理时间的文章,但我不知道如何在这种特殊情况下实现它们,因为羔羊#很重.
I read a lot about how to reduce the processing time using parallel and vectorized operations, but I don't know how to implement them in this particular case, because of the weight lamb#.
有人可以帮助我使用此代码减少处理时间吗?
Can some one help me reduce my processing time with this code?
有关代码和数据结构的更多信息可以在下面作为注释提供的代码中找到.
More information about the code and the structure of the data can be found in the code provided bellow as comments.
# Data frames used to calculate the euclidean distances between each observation
# from DATA1 and each observation from DATA2.
# The euclidean distance is between a [600x50] and a [8X50] dataframes, resulting
# in a [600X8] dataframe.
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
# Weights used for each of the 50 variables to calculate the weighted
# euclidean distance.
# Can be a vector of different weights or a scalar of the same weight
# for all variables.
lambda <- runif(n=50, min=0, max=10) ## length(lambda) > 1
# lambda=1 ## length(lambda) == 1
if (length(lambda) > 1) {
as.numeric(unlist(lambda))
lambda <- as.matrix(lambda)
lambda <- t(lambda)
}
nrows1 <- nrow(DATA1)
nrows2 <- nrow(DATA2)
# Euclidean Distance calculation
DIST <- matrix(NA, nrow=nrows1, ncol=nrows2 )
for (m in 1:nrows1) {
for (i in 1:nrows2) {
if (length(lambda) == 1) {
DIST[m, i] <- sum((DATA1[m, ] - DATA2[i, ])^2)
}
if (length(lambda) > 1){
DIST[m, i] <- sum(((DATA1[m, ] - DATA2[i, ])^2) * lambda[1, ])
}
next
}
next
}
经过所有建议,结合@MDWITT(对于长度(lambda > 1)和@F.Privé(对于长度(lambda == 1))的答案,最终解决方案只用了一分钟运行,而原始我花了一个半小时来运行,在一个更大的代码中,有那个计算.这个问题的最终代码,对于那些有兴趣的人来说,是:
After all the sugestions, combining the answers from @MDWITT (for length(lambda > 1) and @F. Privé (for length(lambda == 1) the final solution took only one minute to run, whilst the original one took me an hour and a half to run, in a bigger code that has that calculation. The final code for this problem, for those interested, is:
#Data frames used to calculate the euclidean distances between each observation from DATA1 and each observation from DATA2.
#The euclidean distance is between a [600x50] and a [8X50] dataframes, resulting in a [600X8] dataframe.
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
#Weights used for each of the 50 variables to calculate the weighted euclidean distance.
#Can be a vector of different weights or a scalar of the same weight for all variables.
#lambda <- runif(n = 50, min = 0, max = 10) ##length(lambda) > 1
lambda = 1 ##length(lambda) == 1
nrows1 <- nrow(DATA1)
nrows2 <- nrow(DATA2)
#Euclidean Distance calculation
DIST <- matrix(NA, nrow = nrows1, ncol = nrows2)
if (length(lambda) > 1){
as.numeric(unlist(lambda))
lambda <- as.matrix(lambda)
lambda <- t(lambda)
library(Rcpp)
cppFunction('NumericMatrix weighted_distance (NumericMatrix x, NumericMatrix y, NumericVector lambda){
int n_x = x.nrow();
int n_y = y.nrow();
NumericMatrix DIST(n_x, n_y);
//begin the loop
for (int i = 0 ; i < n_x; i++){
for (int j = 0 ; j < n_y ; j ++) {
double d = sum(pow(x.row(i) - y.row(j), 2)*lambda);
DIST(i,j) = d;
}
}
return (DIST) ;
}')
DIST <- weighted_distance(DATA1, DATA2, lambda = lambda)}
if (length(lambda) == 1) {
DIST <- outer(rowSums(DATA1^2), rowSums(DATA2^2), '+') - tcrossprod(DATA1, 2 * DATA2)
}
推荐答案
这里使用 Rcpp
的另一种方法只是为了有这个概念文档.在一个名为 euclidean.cpp 的文件中,我有
Here an alternate way using Rcpp
just to have this concept documents. In a file called euclidean.cpp in it I have
#include <Rcpp.h>
#include <cmath>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix weighted_distance (NumericMatrix x, NumericMatrix y, NumericVector lambda){
int n_x = x.nrow();
int n_y = y.nrow();
NumericMatrix out(n_x, n_y);
//begin the loop
for (int i = 0 ; i < n_x; i++){
for (int j = 0 ; j < n_y ; j ++) {
double d = sum(pow(x.row(i) - y.row(j), 2)*lambda);
out(i,j) = d;
}
}
return (out) ;
}
在 R 中,我有
library(Rcpp)
sourceCpp("libs/euclidean.cpp")
# Generate Data
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
lambda <- runif(n=50, min=0, max=10)
# Run the program
out <- weighted_distance(DATA1, DATA2, lambda = lambda)
当我使用以下方法测试速度时:
When I test the speed using:
microbenchmark(
Rcpp_way = weighted_distance(DATA1, DATA2, lambda = lambda),
other = {DIST <- matrix(NA, nrow=nrows1, ncol=ncols)
for (m in 1:nrows1) {
for (i in 1:nrows2) {
if (length(lambda) == 1) {
DIST[m, i] <- sum((DATA1[m, ] - DATA2[i, ])^2)
}
if (length(lambda) > 1){
DIST[m, i] <- sum(((DATA1[m, ] - DATA2[i, ])^2) * lambda[1, ])
}
next
}
next
}}, times = 100)
你可以更快地看到这是一个很好的剪辑:
You can see that it is a good clip faster:
Unit: microseconds
expr min lq mean median uq max neval
Rcpp_way 446.769 492.308 656.9849 562.667 846.9745 1169.231 100
other 24688.821 30681.641 44153.5264 37511.385 50878.3585 200843.898 100
这篇关于如何缩短欧氏距离计算的处理时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!