两个表之间最近点的唯一分配

本文介绍了两个表之间最近点的唯一分配的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在安装了 PostGis 2.2.0 的 Postgres 9.5 数据库中，我有两张包含几何数据(点)的表，我想将一张表中的点分配给来自另一张表的点，但我不希望 buildings.gid 被分配两次.一旦分配了一个 buildings.gid，就不应将其分配给另一个 pvanlagen.buildid.

表定义

建筑物:

创建表 public.buildings (gid numeric NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass),osm_id 字符变化(11)，名称字符变化(48)，键入字符变化(16)，几何几何(MultiPolygon，4326)，质心几何(点，4326)，gembez 字符变化(50)，宝石名称字符变化(50)，krsbez 字符变化(50)，krsname 字符变化(50)，pv 布尔值，gr数字，电容数字，安装日期，pvid 数字，分布数字，CONSTRAINT building_pkey PRIMARY KEY (gid));创建索引 build_centroid_gix在公共建筑上使用要点(st_transform(质心，31467))；创建索引 building_geom_idx在公共建筑上使用要点(几何)；

pvanlagen:

创建表 public.pvanlagen (gid integer NOT NULL DEFAULT nextval('pv_bis2010_bayern_wgs84_gid_seq'::regclass),tso 字符变化(254)，tso_number 数字(10,0)，system_ope 字符变化(254)，system_key 字符变化(254)，位置字符变化(254)，postal_cod 数字(10,0)，街道特征变化(254)，容量数字，电压字符变化(254)，energy_sou 字符变化(254)，开始日期，end_operat 字符变化(254)，id 数字(10,0)，kkz 数字(10,0)，几何几何(点，4326)，gembez 字符变化(50)，宝石名称字符变化(50)，krsbez 字符变化(50)，krsname 字符变化(50)，buildid 数字，分布数字，反式布尔值，约束 pv_bis2010_bayern_wgs84_pkey 主键(gid)，约束 pvanlagen_buildid_fkey 外键(buildid)参考 public.buildings (gid) MATCH SIMPLE更新时不采取行动 删除时不采取行动，约束 pvanlagen_buildid_uni 唯一(buildid));创建索引 pv_bis2010_bayern_wgs84_geom_idx在 public.pvanlagen使用要点(几何)；

查询

我的想法是在 buildings 表中添加一个 boolean 列 pv，当 buildings.gid 已分配:

更新 pvanlagenSET buildid=buildings.gid, dist='50'从建筑物哪里 buildid 为空AND building.pv 为 NULL和 pvanlagen.gemname=buildings.gemnameAND ST_Distance(ST_Transform(pvanlagen.geom,31467),ST_Transform(buildings.centroid,31467))

我在 buildings 中测试了 50 行，但申请所有这些花费的时间太长.我有 3.200.000 个建筑物和 260.000 个 PV.

应分配最近建筑物的gid.如果在平局的情况下，分配哪个 gid 无关紧要.如果我们需要制定规则，我们可以取gid较低的建筑物.

50 米是作为一个限制.我使用 ST_Distance() 因为它返回最小距离，应该在 50 米以内.后来我多次提出，直到每个PV Anlage都被分配.

建筑物和 PV 被分配到它们各自的区域 (gemname).这应该会使分配更便宜，因为我知道最近的建筑物必须在同一区域内(gemname).

我在以下反馈后尝试了此查询:

更新 pvanlagen p1SET buildid = building.gid, dist = 建筑物.dist从 (选择 DISTINCT ON (b.gid)p.id、b.gid、b.dist::numeric从  (选择 ID，ST_Transform(几何，31467)来自 pvanlagenWHERE buildid IS NULL - 尚未分配) p, 横向 (选择 b.gid, ST_Distance(ST_Transform(p1.geom, 31467), ST_Transform(b.centroid, 31467)) AS dist从建筑物 b左加入 pvanlagen p1 ON p1.buildid = b.gid其中 p1.buildid 为空AND b.gemname = p1.gemnameORDER BY ST_Transform(p1.geom, 31467) <->ST_Transform(b.centroid, 31467)限制 1) bORDER BY b.gid, b.dist, p.id -- 决胜局) x, 建筑物其中 p1.id = x.id;

但它返回 0 行在 234 毫秒的执行时间内受到影响. 我哪里错了?

解决方案

`表架构`

要执行您的规则，只需声明 pvanlagen.buildid

`简单的解决方案`

一个简单的解决方案是:

更新 pvanlagen p1SET buildid = sub.b_gid, dist = sub.dist -- 实际距离从  (选择 DISTINCT ON (b_gid)pv_gid, b_gid, dist从  (选择 gid 作为 pv_gid, ST_Transform(geom, 31467) 作为 geom31467来自 pvanlagenWHERE buildid IS NULL - 尚未分配) p, 横向 (选择 b.gid 作为 b_gid，圆形(ST_Distance(p.geom31467, ST_Transform(b.centroid, 31467))::numeric, 2) AS dist -- 见下文从建筑物 bLEFT JOIN pvanlagen p1 ON p1.buildid = b.gid - 也没有分配......WHERE p1.buildid 为 NULL -- ... 然而-- AND p.gemname = b.gemname -- 性能不需要，见下文由 p.geom31467 订购 <->ST_Transform(b.centroid, 31467)限制 1) bORDER BY b_gid, dist, pv_gid -- 决胜局) 子其中 p1.gid = sub.pv_gid;

我使用 DISTINCT ON (b_gid) 将每个建筑物精确地减少到一个行，选择距离最短的 PV.详情:

选择每个 GROUP BY 组中的第一行?

对于任何离多个 PV 最近的建筑物，仅分配最近的 PV.如果两个同样接近，则 PK 列 gid(别名 pv_gid)用作决胜局.在这种情况下，一些 PV 会从更新中删除并保持未分配.重复查询，直到分配所有 PV.

不过，这仍然是一个简单的算法.看我上面的图表，这将建筑物 4 分配给 PV 4，将建筑物 5 分配给 PV 5，而整体而言，4-5 和 5-4 可能是一个更好的解决方案......

`Aside:dist 列的类型`

当前您使用 numeric 为它.您的原始查询分配了一个常量 integer，在 numeric 中没有意义.

在我的新查询中 ST_Distance() 返回实际以米为单位的距离双精度.如果我们简单地分配，我们会在 numeric 数据类型中得到 15 个左右的小数位，并且该数字并不是 that 确切的开头.我严重怀疑你想浪费存储空间.

我宁愿从计算中保存原始的双精度.或者，更好，根据需要进行舍入.如果米数足够精确，只需转换并保存一个 integer (自动将数字四舍五入).或先乘以 100 以节省 cm:

(ST_Distance(...) * 100)::int

In my Postgres 9.5 database with PostGis 2.2.0 installed, I have two tables with geometric data (points) and I want to assign points from one table to the points from the other table, but I don't want a buildings.gid to be assigned twice. As soon as one buildings.gid is assigned, it should not be assigned to another pvanlagen.buildid.

`Table definitions`

buildings:

CREATE TABLE public.buildings (
  gid numeric NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass),
  osm_id character varying(11),
  name character varying(48),
  type character varying(16),
  geom geometry(MultiPolygon,4326),
  centroid geometry(Point,4326),
  gembez character varying(50),
  gemname character varying(50),
  krsbez character varying(50),
  krsname character varying(50),
  pv boolean,
  gr numeric,
  capac numeric,
  instdate date,
  pvid numeric,
  dist numeric,
  CONSTRAINT buildings_pkey PRIMARY KEY (gid)
);

CREATE INDEX build_centroid_gix
  ON public.buildings
  USING gist
  (st_transform(centroid, 31467));

CREATE INDEX buildings_geom_idx
  ON public.buildings
  USING gist
  (geom);

pvanlagen:

CREATE TABLE public.pvanlagen (
  gid integer NOT NULL DEFAULT nextval('pv_bis2010_bayern_wgs84_gid_seq'::regclass),
  tso character varying(254),
  tso_number numeric(10,0),
  system_ope character varying(254),
  system_key character varying(254),
  location character varying(254),
  postal_cod numeric(10,0),
  street character varying(254),
  capacity numeric,
  voltage_le character varying(254),
  energy_sou character varying(254),
  beginning_ date,
  end_operat character varying(254),
  id numeric(10,0),
  kkz numeric(10,0),
  geom geometry(Point,4326),
  gembez character varying(50),
  gemname character varying(50),
  krsbez character varying(50),
  krsname character varying(50),
  buildid numeric,
  dist numeric,
  trans boolean,
  CONSTRAINT pv_bis2010_bayern_wgs84_pkey PRIMARY KEY (gid),
  CONSTRAINT pvanlagen_buildid_fkey FOREIGN KEY (buildid)
      REFERENCES public.buildings (gid) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT pvanlagen_buildid_uni UNIQUE (buildid)
);

CREATE INDEX pv_bis2010_bayern_wgs84_geom_idx
  ON public.pvanlagen
  USING gist
  (geom);

`Query`

My idea was to add a boolean column pv in the buildings table, which is set when a buildings.gid was assigned:

UPDATE pvanlagen
SET buildid=buildings.gid, dist='50'
FROM buildings
WHERE buildid IS NULL
AND buildings.pv is NULL
AND pvanlagen.gemname=buildings.gemname
AND ST_Distance(ST_Transform(pvanlagen.geom,31467)
               ,ST_Transform(buildings.centroid,31467))<50;

UPDATE buildings
SET pv=true
FROM pvanlagen
WHERE buildings.gid=pvanlagen.buildid;

I tested for 50 rows in buildings but it takes too long to apply for all of them. I have 3.200.000 buildings and 260.000 PV.

The gid of the closest building shall be assigned. If In case of ties, it should not matter which gid is assigned. If we need to frame a rule, we can take the building with the lower gid.

50 meters was meant to work as a limit. I used ST_Distance() because it returns the minimum distance, which should be within 50 meters. Later I raised it multiple times, until every PV Anlage was assigned.

Buildings and PV are assigned to their respective regions (gemname). This should make the assignment cheaper, since I know the nearest building must be within the same region (gemname).

I tried this query after feedback below:

UPDATE pvanlagen p1
SET    buildid = buildings.gid
 , dist = buildings.dist
FROM (
   SELECT DISTINCT ON (b.gid)
          p.id, b.gid, b.dist::numeric
   FROM  (
      SELECT id, ST_Transform(geom, 31467)
      FROM   pvanlagen
      WHERE  buildid IS NULL  -- not assigned yet
      ) p
        , LATERAL (
      SELECT b.gid, ST_Distance(ST_Transform(p1.geom, 31467), ST_Transform(b.centroid, 31467)) AS dist
      FROM   buildings      b
      LEFT   JOIN pvanlagen p1 ON p1.buildid = b.gid
      WHERE  p1.buildid IS NULL
      AND    b.gemname = p1.gemname
      ORDER  BY ST_Transform(p1.geom, 31467) <-> ST_Transform(b.centroid, 31467)
      LIMIT  1
            ) b
       ORDER  BY b.gid, b.dist, p.id  -- tie breaker
       ) x, buildings
 WHERE   p1.id = x.id;

But it returns with 0 rows affected in 234 ms execution time. Where am I going wrong?

解决方案

`Table schema`

To enforce your rule simply declare pvanlagen.buildid UNIQUE:

ALTER TABLE pvanlagen ADD CONSTRAINT pvanlagen_buildid_uni UNIQUE (buildid);

building.gid is the PK, as your update revealed. To also enforce referential integrity add a FOREIGN KEY constraint to buildings.gid.

You have implemented both by now. But it would be more efficient to run the big UPDATE below before you add these constraints.

There is a lot more that should be improved in your table definition. For one, buildings.gid as well as pvanlagen.buildid should be type integer (or possibly bigint if you burn a lot of PK values). numeric is expensive nonsense.

Let's focus on the core problem:

`Basic Query to find closest building`

The case is not as simple as it may seem. It's a "nearest neighbour" problem, with the additional complication of unique assignment.

This query finds the nearest one building for each PV (short for PV Anlage - row in pvanlagen), where neither is assigned, yet:

SELECT pv_gid, b_gid, dist
FROM  (
   SELECT gid AS pv_gid, ST_Transform(geom, 31467) AS geom31467
   FROM   pvanlagen
   WHERE  buildid IS NULL  -- not assigned yet
   ) p
     , LATERAL (
   SELECT b.gid AS b_gid
        , round(ST_Distance(p.geom31467
                      , ST_Transform(b.centroid, 31467))::numeric, 2) AS dist  -- see below
   FROM   buildings b
   LEFT   JOIN pvanlagen p1 ON p1.buildid = b.gid  -- also not assigned ...
   WHERE  p1.buildid IS NULL                       -- ... yet
   -- AND    p.gemname = b.gemname                 -- not needed for performance, see below
   ORDER  BY p.geom31467 <-> ST_Transform(b.centroid, 31467)
   LIMIT  1
   ) b;

To make this query fast, you need a spatial, functional GiST index on buildings to make it much faster:

CREATE INDEX build_centroid_gix ON buildings USING gist (ST_Transform(centroid, 31467));

Not sure why you don't

Related answers with more explanation:

Spatial query on large table with multiple self joins performing slow
How do I query all rows within a 5-mile radius of my coordinates?

Further reading:

http://workshops.boundlessgeo.com/postgis-intro/knn.html
http://www.postgresonline.com/journal/archives/306-KNN-GIST-with-a-Lateral-twist-Coming-soon-to-a-database-near-you.html

With the index in place, we don't need to restrict matches to the same gemname for performance. Only do this if it's an actual rule to enforced. If it has to be observed at all times, include the column in the FK constraint:

Restrict foreign key relationship to rows of related subtypes

`Remaining Problem`

We can use the above query it in an UPDATE statement. Each PV is only used once, but more than one PV might still find the same building to be closest. You only allow one PV per building. So how would you resolve that?

In other words, how would you assign objects here?

`Simple solution`

One simple solution would be:

UPDATE pvanlagen p1
SET    buildid = sub.b_gid
     , dist    = sub.dist  -- actual distance
FROM  (
   SELECT DISTINCT ON (b_gid)
          pv_gid, b_gid, dist
   FROM  (
      SELECT gid AS pv_gid, ST_Transform(geom, 31467) AS geom31467
      FROM   pvanlagen
      WHERE  buildid IS NULL  -- not assigned yet
      ) p
        , LATERAL (
      SELECT b.gid AS b_gid
           , round(ST_Distance(p.geom31467
                         , ST_Transform(b.centroid, 31467))::numeric, 2) AS dist  -- see below
      FROM   buildings      b
      LEFT   JOIN pvanlagen p1 ON p1.buildid = b.gid  -- also not assigned ...
      WHERE  p1.buildid IS NULL                       -- ... yet
      -- AND    p.gemname = b.gemname                 -- not needed for performance, see below
      ORDER  BY p.geom31467 <-> ST_Transform(b.centroid, 31467)
      LIMIT  1
      ) b
   ORDER  BY b_gid, dist, pv_gid  -- tie breaker
   ) sub
WHERE   p1.gid = sub.pv_gid;

I use DISTINCT ON (b_gid) to reduce to exactly one row per building, picking the PV with the shortest distance. Details:

Select first row in each GROUP BY group?

For any building that is closest for more one PV, only the closest PV is assigned. The PK column gid (alias pv_gid) serves as tiebreaker if two are equally close. In such a case, some PV are dropped from the update and remain unassigned. Repeat the query until all PV are assigned.

This is still a simplistic algorithm, though. Looking at my diagram above, this assigns building 4 to PV 4 and building 5 to PV 5, while 4-5 and 5-4 would probably be a better solution overall ...

`Aside: type for dist column`

Currently you use numeric for it. your original query assigned a constant integer, no point making in numeric.

In my new query ST_Distance() returns the actual distance in meters as double precision. If we simply assign that we get 15 or so fractional digits in the numeric data type, and the number is not that exact to begin with. I seriously doubt you want to waste the storage.

I would rather save the original double precision from the calculation. or, better yet, round as needed. If meters are exact enough, just cast to and save an integer (rounding the number automatically). Or multiply with 100 first to save cm:

(ST_Distance(...) * 100)::int

这篇关于两个表之间最近点的唯一分配的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！