问题描述
我正在使用ELKI来挖掘一些地理空间数据(纬长对),并且我非常关注使用正确的数据类型和算法.在算法的参数化程序上,我尝试通过以下方式通过地理函数(LngLatDistanceFunction,因为我使用的是x,y数据)来更改默认距离函数:
I am using ELKI to mine some geospatial data (lat,long pairs) and I am quite concerned on using the right data types and algorithms. On the parameterizer of my algorithm, I tried to change the default distance function by a geo function (LngLatDistanceFunction, as I am using x,y data) as bellow:
params.addParameter (DISTANCE_FUNCTION_ID, geo.LngLatDistanceFunction.class);
但是结果却非常令人惊讶:它创建了重复点的簇,例如下面的示例:
However the results are quite surprising: it creates clusters of a repeated point, such as the example bellow:
(2.17199922,41.38190043,NaN),(2.17199922,41.38190043,NaN),(2.17199922,41.38190043,NaN),(2.17199922,41.38190043,NaN),(2.17199922,41.38190043,NaN),(2.17199922,41.38190043,NaN) ,(2.17199922,41.38190043,NaN),(2.17199922,41.38190043,NaN),(2.17199922,41.38190043,NaN),(2.17199922,41.38190043,NaN)]
(2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN)]
是否使用非地理距离距离(例如,曼哈顿):
Whether I used a non-geo distance (for instance manhattan):
params.addParameter (DISTANCE_FUNCTION_ID, geo.minkowski.ManhattanDistanceFunction.class);
,t 输出结果更加合理
我想知道我的代码是否有问题.
I wonder if there is something wrong with my code.
我直接在数据库上运行算法,如下所示:
I am running the algorithm directly on the db, like this:
Clustering<Model> result = dbscan.run(db);
然后在构造凸包的同时循环遍历结果:
And then iterating over the results in a loop, while I construct the convex hulls:
for (de.lmu.ifi.dbs.elki.data.Cluster<?> cl : result.getAllClusters()) {
if (!cl.isNoise()){
Coordinate[] ptList=new Coordinate[cl.size()];
int ct=0;
for (DBIDIter iter = cl.getIDs().iter();
iter.valid(); iter.advance()) {
ptList[ct]=dataMap.get(DBIDUtil.toString(iter));
++ct;
}
GeoPolygon poly=getBoundaryFromCoordinates(ptList);
if (poly.getCoordinates().getGeometryType()==
"Polygon"){
out.write(poly.coordinates.toText()+"\n");
}
}
}
为了将每个ID映射到一个点,我使用一个哈希图,该哈希图是在读取数据库时初始化的.之所以添加此代码,是因为我怀疑我正在对算法进行传递或从中读取结构时可能做错了什么.在此先感谢您提出的任何可帮助我解决此问题的意见.我发现ELKI是一个非常高效和完善的库,但是我很难找到示例来说明简单的情况,例如我的.
To map each ID to a point, I use a hashmap, that I initialized when reading the database.The reason why I am adding this code, is because I suspect that I may doing something wrong regarding the structures that I am passing/reading to/from the algorithm.I thank you in advance for any comments that could help me to solve this. I find ELKI a very efficient and sophisticated library, but I have trouble to find examples that illustrate simple cases, like mine.
推荐答案
您的epsilon
值是多少?
地理距离以ELKI中的米为单位(如果我没记错的话);曼哈顿的距离应为纬度+经度度.出于明显的原因,它们的比例非常不同,因此您需要选择不同的epsilon值.
Geographic distance is in meters in ELKI (if I recall correctly); Manhattan distance would be in latitude + longitude degrees. For obvious reasons, these live on very different scales, and therefore you need to choose a different epsilon value.
在先前的问题中,您使用的是epsilon=0.008
.大地距离为0.008米= 8毫米.
In your previous questions, you used epsilon=0.008
. For geodetic distance, 0.008 meters = 8 millimeter.
在epsilon = 8毫米的情况下,如果您得到的簇仅包含重复的坐标,我不会感到惊讶.是否有上述坐标确实在您的数据集中多次存在的机会?
At epsilon = 8 millimeter, I am not surprised if the clusters you get consist only of duplicated coordinates. Any chance that above coordinates do exist multiple times in your data set?
这篇关于在ELKI上使用地理距离功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!