问题描述
我想计算每个省的房屋数量之间的平均地理距离。
I want to calculate the average geographical distance between a number of houses per province.
假设我有以下数据。
df1 <- data.frame(province = c(1, 1, 1, 2, 2, 2),
house = c(1, 2, 3, 4, 5, 6),
lat = c(-76.6, -76.5, -76.4, -75.4, -80.9, -85.7),
lon = c(39.2, 39.1, 39.3, 60.8, 53.3, 40.2))
使用地球圈$我可以找到两所房子之间的距离。例如:
Using the geosphere
library I can find the distance between two houses. For instance:
library(geosphere)
distm(c(df1$lon[1], df1$lat[1]), c(df1$lon[2], df1$lat[2]), fun = distHaversine)
#11429.1
如何计算该省所有房屋之间的距离并收集每个省的平均距离?
How do I calculate the distance between all the houses in the province and gather the mean distance per province?
原始数据集每个省都有数百万个观测值,因此此处的性能也是一个问题。
The original data-set has millions of observations per province, so performance is an issue here, too.
推荐答案
我的最初想法是查看 distHaversine
的源代码并将其复制到与。
这样工作(请注意, lon
应该是第一列):
My initial idea was to look at the source code of distHaversine
and replicate it in a function that I would use with proxy
.That would work like this (note that lon
is expected to be the first column):
library(geosphere)
library(dplyr)
library(proxy)
df1 <- data.frame(province = as.integer(c(1, 1, 1, 2, 2, 2)),
house = as.integer(c(1, 2, 3, 4, 5, 6)),
lat = c(-76.6, -76.5, -76.4, -75.4, -80.9, -85.7),
lon = c(39.2, 39.1, 39.3, 60.8, 53.3, 40.2))
custom_haversine <- function(x, y) {
toRad <- pi / 180
diff <- (y - x) * toRad
dLon <- diff[1L]
dLat <- diff[2L]
a <- sin(dLat / 2) ^ 2 + cos(x[2L] * toRad) * cos(y[2L] * toRad) * sin(dLon / 2) ^ 2
a <- min(a, 1)
# return
2 * atan2(sqrt(a), sqrt(1 - a)) * 6378137
}
pr_DB$set_entry(FUN=custom_haversine, names="haversine", loop=TRUE, distance=TRUE)
average_dist <- df1 %>%
select(-house) %>%
group_by(province) %>%
group_map(~ data.frame(avg=mean(proxy::dist(.x[ , c("lon", "lat")], method="haversine"))))
但是,如果您希望每个省数百万行, b $ b proxy
可能无法分配中间(矩阵的下三角)矩阵。
所以我将代码移植到C ++并添加了多线程作为奖励:
However, if you're expecting millions of rows per province,proxy
probably won't be able to allocate the intermediate (lower triangular of the) matrices.So I ported the code to C++ and added multi-threading as a bonus:
EDIT :原来 s2d
辅助工具远非最佳,
此版本现在使用。
EDIT: turns out the s2d
helper was far from optimal,this version now uses the formulas given here.
EDIT2 :我刚刚发现了,
,可用于检测用户中断。
EDIT2: I just found out about RcppThread,and it can be used to detect user interrupt.
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppParallel,RcppThread)]]
#include <cstddef> // size_t
#include <math.h> // sin, cos, sqrt, atan2, pow
#include <vector>
#include <RcppThread.h>
#include <Rcpp.h>
#include <RcppParallel.h>
using namespace std;
using namespace Rcpp;
using namespace RcppParallel;
// single to double indices for lower triangular of matrices without diagonal
void s2d(const size_t id, const size_t nrow, size_t& i, size_t& j) {
j = nrow - 2 - static_cast<size_t>(sqrt(-8 * id + 4 * nrow * (nrow - 1) - 7) / 2 - 0.5);
i = id + j + 1 - nrow * (nrow - 1) / 2 + (nrow - j) * ((nrow - j) - 1) / 2;
}
class HaversineCalculator : public Worker
{
public:
HaversineCalculator(const NumericVector& lon,
const NumericVector& lat,
double& avg,
const int n)
: lon_(lon)
, lat_(lat)
, avg_(avg)
, n_(n)
, cos_lat_(lon.length())
{
// terms for distance calculation
for (size_t i = 0; i < cos_lat_.size(); i++) {
cos_lat_[i] = cos(lat_[i] * 3.1415926535897 / 180);
}
}
void operator()(size_t begin, size_t end) {
// for Kahan summation
double sum = 0;
double c = 0;
double to_rad = 3.1415926535897 / 180;
size_t i, j;
for (size_t ind = begin; ind < end; ind++) {
if (RcppThread::isInterrupted(ind % static_cast<int>(1e5) == 0)) return;
s2d(ind, lon_.length(), i, j);
// haversine distance
double d_lon = (lon_[j] - lon_[i]) * to_rad;
double d_lat = (lat_[j] - lat_[i]) * to_rad;
double d_hav = pow(sin(d_lat / 2), 2) + cos_lat_[i] * cos_lat_[j] * pow(sin(d_lon / 2), 2);
if (d_hav > 1) d_hav = 1;
d_hav = 2 * atan2(sqrt(d_hav), sqrt(1 - d_hav)) * 6378137;
// the average part
d_hav /= n_;
// Kahan sum step
double y = d_hav - c;
double t = sum + y;
c = (t - sum) - y;
sum = t;
}
mutex_.lock();
avg_ += sum;
mutex_.unlock();
}
private:
const RVector<double> lon_;
const RVector<double> lat_;
double& avg_;
const int n_;
tthread::mutex mutex_;
vector<double> cos_lat_;
};
// [[Rcpp::export]]
double avg_haversine(const DataFrame& input, const int nthreads) {
NumericVector lon = input["lon"];
NumericVector lat = input["lat"];
double avg = 0;
int size = lon.length() * (lon.length() - 1) / 2;
HaversineCalculator hc(lon, lat, avg, size);
int grain = size / nthreads / 10;
RcppParallel::parallelFor(0, size, hc, grain);
RcppThread::checkUserInterrupt();
return avg;
}
此代码不会分配任何中间矩阵,
会分配只需计算下三角形的每一对的距离,最后累积平均值的值即可。
关于Kahan求和部分,请参见。
This code won't allocate any intermediate matrix,it will simply calculate the distance for each pair of what would be the lower triangular and accumulate the values for an average in the end.See here for the Kahan summation part.
如果将代码保存在 haversine.cpp
,
中,则可以执行以下操作:
If you save that code in, say, haversine.cpp
,then you can do the following:
library(dplyr)
library(Rcpp)
library(RcppParallel)
library(RcppThread)
sourceCpp("haversine.cpp")
df1 %>%
group_by(province) %>%
group_map(~ data.frame(avg=avg_haversine(.x, parallel::detectCores())))
# A tibble: 2 x 2
# Groups: province [2]
province avg
<int> <dbl>
1 1 15379.
2 2 793612.
这里也是一项健全性检查:
Here's a sanity check too:
pr_DB$set_entry(FUN=geosphere::distHaversine, names="distHaversine", loop=TRUE, distance=TRUE)
df1 %>%
select(-house) %>%
group_by(province) %>%
group_map(~ data.frame(avg=mean(proxy::dist(.x[ , c("lon", "lat")], method="distHaversine"))))
不过请注意:
df <- data.frame(lon=runif(1e3, -90, 90), lat=runif(1e3, -90, 90))
system.time(proxy::dist(df, method="distHaversine"))
user system elapsed
34.353 0.005 34.394
system.time(proxy::dist(df, method="haversine"))
user system elapsed
0.789 0.020 0.809
system.time(avg_haversine(df, 4L))
user system elapsed
0.054 0.000 0.014
df <- data.frame(lon=runif(1e5, -90, 90), lat=runif(1e5, -90, 90))
system.time(avg_haversine(df, 4L))
user system elapsed
73.861 0.238 19.670
如果您有数百万行,您可能必须要等一段时间...
You'll probably have to wait quite a while if you have millions of rows...
请参见上面的EDIT2。
See EDIT2 above.
根据您的实际数据以及计算机具有多少个内核,
可能需要等待几天才能完成计算。
这个问题的平方复杂度是
(可以说每个省)。
这行:
Depending on your actual data and how many cores your computer has,you may very well end up waiting days for the calculation to finish.This problem has quadratic complexity(per province, so to speak).This line:
int size = lon.length() * (lon.length() - 1) / 2;
表示必须执行的(haversine)距离计算量。
因此,如果行数增加了 n
,则
计算的数量增加了 n粗略地说,是^ 2/2
。
signifies the amount of (haversine) distance calculations that must be performed.So if the number of rows increases by a factor of n
,the number of calculations increases by a factor of n^2 / 2
, roughly speaking.
没有办法对此进行优化。
,如果不先计算每个数字,就无法计算 N
个数字的平均值,
很难找到比多线程C ++代码
,因此您要么不得不等待它,要么
要么抛出更多内核,要么单台机器或多台机器一起工作就需要
。
否则您将无法解决此问题。
There is no way to optimize this;you can't calculate the average of N
numbers without actually computing each number first,and you'll have a hard time finding something faster than multi-threaded C++ code,so you'll either have to wait it out,or throw more cores at the problem,either with a single machine or with many machines working together.Otherwise you can't solve this problem.
这篇关于按组的地理距离-在每对行上应用函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!