问题描述
我正在尝试从一个包含四列纬度和经度数据以及大约三百万行的数据帧计算测地距离.我使用了应用lambda方法来执行此操作,但是花了18分钟才能完成任务.有没有一种方法可以将Vectorization与NumPy数组配合使用来加快计算速度?谢谢您的回答.
I am trying to calculate geodesic distance from a dataframe which consists of four columns of latitude and longitude data with around 3 million rows. I used the apply lambda method to do it but it took 18 minutes to finish the task. Is there a way to use Vectorization with NumPy arrays to speed up the calculation? Thank you for answering.
我的代码使用apply和lambda方法:
My code using apply and lambda method:
from geopy import distance
df['geo_dist'] = df.apply(lambda x: distance.distance(
(x['start_latitude'], x['start_longitude']),
(x['end_latitude'], x['end_longitude'])).miles, axis=1)
更新:
我正在尝试这段代码,但它给了我错误:ValueError:具有多个元素的数组的真值是不明确的.使用a.any()或a.all().感谢任何人都可以提供帮助.
I am trying this code but it gives me the error: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Appreciate if anyone can help.
df['geo_dist'] = distance.distance(
(df['start_latitude'].values, df['start_longitude'].values),
(df['end_latitude'].values, df['end_longitude'].values)).miles
推荐答案
我认为您可以考虑为此使用geopandas
,这是熊猫的扩展(因此numpy
旨在非常快速地进行这些类型的计算.
I think you might consider using geopandas
for this, it's an extension of pandas (and therefore numpy
designed to do these types of calculations very quickly.
具体来说,它具有一种用于计算GeoSeries
中各点之间的距离的方法,它可以是GeoDataFrame
的一列.我相当确定该方法利用numexpr
进行矢量化.
Specifically, it has a method for calculating the distance between sets of points in a GeoSeries
, which can be a column of a GeoDataFrame
. I’m fairly certain that this method leverages numexpr
for vectorization.
应该看起来像这样,在这里您将数据框转换为具有至少两个可用于原点和点目的地的GeoSeries
列的GeoDataFrame
.这应该返回一个GeoSeries
对象:
It should look something like this, where you convert your data frame to a GeoDataFrame
with (at least) two GeoSeries
columns that you can use for the origin and point destinations. This should return a GeoSeries
object:
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
geometry = [Point(xy) for xy in zip(df.longitude, df.latitude)]
gdf = gpd.GeoDataFrame(df, crs={'init': 'epsg:4326'}, geometry=geometry)
distances = gdf.geometry.distance(gdf.destination_geometry)
这篇关于如何使用带有NumPy数组的Vectorization使用Geopy库计算大数据集的测地距离?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!