问题描述
我有两个数据框,一个是具有80万行的X和Y坐标,另一个数据框是70000行,具有X和Y坐标.我想知道R中的逻辑和代码,我想将第1帧中的数据点与第2帧中的最近点相关联.是否有任何标准包装?
I have two data frames , one is with 0.8 million rows with x and Y coordinates, another data frame is of 70000 rows with X and Y coordinates. I want to know logic and code in R where I want to associate data point from frame 1 to closest point in data frame 2. Is there any standard package to do so ?
我正在嵌套运行循环.但这非常慢,因为它要进行80万次* 70000次的迭代,这非常耗时.
I am running nested for loop. But this is very slow as it is getting iterated for 0.8 million * 70000 times which is very time consuming.
推荐答案
我找到了使用data.table
库获得预期结果的更快方法:
I found a faster way to get the expected result using the data.table
library:
library(data.table)
time0 <- Sys.time()
以下是一些随机数据:
df1 <- data.table(x = runif(8e5), y = runif(8e5))
df2 <- data.table(x = runif(7e4), y = runif(7e4))
假设(x,y)是正交法坐标系中的坐标,则可以如下计算距离的平方:
Assuming (x,y) are the coordinates in an orthonormal coordinate system, you can compute the square of the distance as follow:
dist <- function(a, b){
dt <- data.table((df2$x-a)^2+(df2$y-b)^2)
return(which.min(dt$V1))}
现在您可以将此功能应用于数据以获得预期的结果:
And now you can applied this function to your data to get the expected result:
results <- df1[, j = list(Closest = dist(x, y)), by = 1:nrow(df1)]
time1 <- Sys.time()
print(time1 - time0)
我花了大约30分钟的时间才能在慢速的计算机上获得结果.
It tooked me around 30 minutes to get the result on a slow computer.
根据要求,我尝试使用sapply
或plyr
软件包中的adply
尝试其他几种解决方案.我已经在较小的数据帧上测试了这些解决方案,以使其更快.
As asked, I have tried severals other solutions using sapply
or using adply
from the plyr
package. I have tested these solutions on smaller data frames to make it faster.
library(data.table)
library(plyr)
library(microbenchmark)
########################
## Test 1: data.table ##
########################
dt1 <- data.table(x = runif(1e4), y = runif(1e4))
dt2 <- data.table(x = runif(5e3), y = runif(5e3))
dist1 <- function(a, b){
dt <- data.table((dt2$x-a)^2+(dt2$y-b)^2)
return(which.min(dt$V1))}
results1 <- function() return(dt1[, j = list(Closest = dist1(x, y)), by = 1:nrow(dt1)])
###################
## Test 2: adply ##
###################
df1 <- data.frame(x = runif(1e4), y = runif(1e4))
df2 <- data.frame(x = runif(5e3), y = runif(5e3))
dist2 <- function(df){
dt <- data.table((df2$x-df$x)^2+(df2$y-df$y)^2)
return(which.min(dt$V1))}
results2 <- function() return(adply(.data = df1, .margins = 1, .fun = dist2))
####################
## Test 3: sapply ##
####################
df1 <- data.frame(x = runif(1e4), y = runif(1e4))
df2 <- data.frame(x = runif(5e3), y = runif(5e3))
dist2 <- function(df){
dt <- data.table((df2$x-df$x)^2+(df2$y-df$y)^2)
return(which.min(dt$V1))}
results3 <- function() return(sapply(1:nrow(df1), function(x) return(dist2(df1[x,]))))
microbenchmark(results1(), results2(), results3(), times = 20)
#Unit: seconds
# expr min lq mean median uq max neval
# results1() 4.046063 4.117177 4.401397 4.218234 4.538186 5.724824 20
# results2() 5.503518 5.679844 5.992497 5.886135 6.041192 7.283477 20
# results3() 4.718865 4.883286 5.131345 4.949300 5.231807 6.262914 20
第一个解决方案似乎比其他两个解决方案要快得多.对于更大的数据集,情况更是如此.
The first solution seems to be significantly faster than the 2 other. This is even more true for a larger dataset.
这篇关于从其他数据框中查找最近的点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!