问题描述
我正在尝试从大距离矩阵创建 dist
对象.我在使用 stats::as.dist
时内存不足.例如,我在当前机器上有大约 128 Gb 可用,但 as.dist
在处理 73,000 x 73,000 矩阵(约 42Gb)时内存不足.鉴于最终的 dist 对象应该小于矩阵大小的一半(即它是下三角形,存储为向量)在我看来应该可以在一种更节省内存的方法 - 前提是我们小心不要创建大型中间对象,而只是将输入的相关元素直接复制到输出.
I'm trying to create dist
objects from large distance matrices. I'm running out of memory using stats::as.dist
. For example, I have around 128 Gb available on current machine but as.dist
runs out of memory when processing a 73,000 x 73,000 matrix (approx 42Gb). Given that the final dist
object should be less than half the size of the matrix (i.e. it is the lower triangle, stored as a vector) it seems to me that it should be possible to do this calculation in a more memory-efficient way - provided we are careful not to create large intermediate objects and just copy the relevant elements of the input directly to the output.
查看 getS3method('as.dist', 'default')
的代码,我看到它使用 ans <- m[row(m) >col(m)]
需要创建与输入具有相同维度的 row
和 col
矩阵.
Looking at the code for getS3method('as.dist', 'default')
, I see that it does the calculation using ans <- m[row(m) > col(m)]
which requires the creation of row
and col
matrices of the same dimensionality as the input.
我想我可以使用 此处 中的算法对此进行改进,以生成下三角形的索引.这是我使用这种方法的尝试.
I thought I might be able to improve on this using the algorithm from here to generate the indexes of the lower triangle. Here is my attempt using this method.
as.dist.new = function(dm, diag = FALSE, upper = FALSE) {
n = dim(dm)[1]
stopifnot(is.matrix(dm))
stopifnot(dim(dm)[2] == n)
k = 1:((n^2 - n)/2)
j <- floor(((2 * n + 1) - sqrt((2 * n - 1) ^ 2 - 8 * (k - 1))) / 2)
i <- j + k - (2 * n - j) * (j - 1) / 2
idx = cbind(i,j)
remove(i,j,k)
gc()
d = dm[idx]
class(d) <- "dist"
attr(d, "Size") <- n
attr(d, "call") <- match.call()
attr(d, "Diag") <- diag
attr(d, "Upper") <- upper
d
}
这适用于较小的矩阵.这是一个简单的例子:
This works fine on smaller matrices. Here's a simple example:
N = 10
dm = matrix(runif(N*N), N, N)
diag(dm) = 0
x = as.dist(dm)
y = as.dist.new(dm)
然而,如果我们创建一个更大的距离矩阵,它会遇到与 as.dist
相同的内存问题.
However, if we create a larger distance matrix it runs into the same memory issues as as.dist
.
例如此版本在我的系统上崩溃:
E.g. this version crashes on my system:
N = 73000
dm = matrix(runif(N*N), N, N)
gc()
diag(dm) = 0
gc()
as.dist.new(dm)
有人建议如何更有效地执行此操作吗?欢迎使用 R 或 Rcpp 解决方案.NB 看着 this answer 到一个相关的问题(从 2 列位置数据生成全距离矩阵)似乎它使用 RcppArmadillo
可能可以做到这一点,但我没有使用它的经验.
Does anyone have a suggestion how to perform this operation more efficiently? R or Rcpp solutions welcome. NB looking at this answer to a related problem (generating the full distance matrix from 2-column location data) it seems that it may be possible to do this using RcppArmadillo
, but I've got no experience of using that.
推荐答案
我目前的解决方案是直接从 lat 和 lon 向量计算 dist 对象,根本不生成中间距离矩阵.在大型矩阵上,这比传统"矩阵快数百倍.geosphere::mdist()
后跟 stats::as.dist()
的管道,并且只使用存储最终 dist 对象所需的内存.
My current solution is to calculate the dist object directly from lat and lon vectors, without generating the intermediate distance matrix at all. On large matrices, this is several hundred times faster than the "conventional" pipeline of geosphere::mdist()
followed by stats::as.dist()
and uses only as much memory as required to store the final dist object.
以下 Rcpp 源基于使用 此处 中的函数来计算 c++ 中的正弦距离,并进行了改编@Alexis 方法在 C++ 中遍历下三角元素.
The following Rcpp source is based on using the functions from here to calculate haversine distance in c++, together with an adaptation of @Alexis method to iterate through the lower triangle elements in c++.
#include <Rcpp.h>
using namespace Rcpp;
double distanceHaversine(double latf, double lonf, double latt, double lont, double tolerance){
double d;
double dlat = latt - latf;
double dlon = lont - lonf;
d = (sin(dlat * 0.5) * sin(dlat * 0.5)) + (cos(latf) * cos(latt)) * (sin(dlon * 0.5) * sin(dlon * 0.5));
if(d > 1 && d <= tolerance){
d = 1;
}
return 2 * atan2(sqrt(d), sqrt(1 - d)) * 6378137.0;
}
double toRadians(double deg){
return deg * 0.01745329251; // PI / 180;
}
//-----------------------------------------------------------
// [[Rcpp::export]]
NumericVector calc_dist(Rcpp::NumericVector lat,
Rcpp::NumericVector lon,
double tolerance = 10000000000.0) {
std::size_t nlat = lat.size();
std::size_t nlon = lon.size();
if (nlat != nlon) throw std::range_error("lat and lon different lengths");
if (nlat < 2) throw std::range_error("Need at least 2 points");
std::size_t size = nlat * (nlat - 1) / 2;
NumericVector ans(size);
std::size_t k = 0;
double latf;
double latt;
double lonf;
double lont;
for (std::size_t j = 0; j < (nlat-1); j++) {
for (std::size_t i = j + 1; i < nlat; i++) {
latf = toRadians(lat[i]);
lonf = toRadians(lon[i]);
latt = toRadians(lat[j]);
lont = toRadians(lon[j]);
ans[k++] = distanceHaversine(latf, lonf, latt, lont, tolerance);
}
}
return ans;
}
/*** R
as_dist = function(lat, lon, tolerance = 10000000000.0) {
dd = calc_dist(lat, lon, tolerance)
attr(dd, "class") = "dist"
attr(dd, "Size") = length(lat)
attr(dd, "call") = match.call()
attr(dd, "Diag") = FALSE
attr(dd, "Upper") = FALSE
return(dd)
}
*/
这篇关于从距离矩阵创建 dist 对象的内存高效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!