问题描述
我有一个IP地址表和一个IP地址范围表(起始ip,结束ip),我想将它们合并在一起.我已经可以通过以下查询来完成这项工作:
I have a table of IP addresses and a table of IP address ranges (start ip, end ip) that I'd like to join together. I've been able to make this work with the following query:
SELECT * FROM `ips` i
JOIN `ranges` a
ON NET.SAFE_IP_FROM_STRING(i.ip)
BETWEEN NET.SAFE_IP_FROM_STRING(a.start_ip)
AND NET.SAFE_IP_FROM_STRING(a.end_ip)
我遇到的问题是它的缩放比例非常差.要完成10个IP,大约需要8秒,而100个IP需要30秒,而1000个需要几分钟.我希望能够进行数千万行的操作. (我曾尝试将NET.SAFE_IP_FROM_STRING
的输出写入ranges表,但这只会使速度提高10%左右,并且对缩放没有帮助).
The problem I'm having is that it scales really badly. To do it for 10 IPs takes around 8 seconds, 100 takes 30 seconds, and 1000 takes a few minutes. I'd like to be able to do this for tens of millions of rows. (I have tried writing the output of NET.SAFE_IP_FROM_STRING
to the ranges table, but it only speeds things up by around 10%, and doesn't help with scaling).
范围不重叠,因此对于输入表中的每一行,我希望输出表中的行为0或1. LATERAL JOIN
可以让我做到这一点,并且几乎可以肯定,它可以加快速度,但是我不认为BigQuery支持它们.还有其他方法可以使此查询更快且可扩展吗?
The ranges don't overlap, so for every row in the input table I expect 0 or 1 rows in the output table. A LATERAL JOIN
would let me do that and almost certainly speed things up, but I don't think BigQuery supports them. Is there any other way to make this query faster and scalable?
推荐答案
在 https://cloudplatform.googleblog.com/2014/03/geoip-geolocation-with-google-bigquery.html (已在Felipe的答案中链接到了我)结合在一起的东西非常快,而且扩展性非常好.正如Felipe提到的那样,诀窍是对前缀(我使用/16)进行直接连接,然后使用之间进行过滤.我正在对范围进行预处理,以将大于/16的任何内容拆分为多个块.然后,我使用此查询覆盖该表,该查询将添加一些其他字段:
After reviewing the article at https://cloudplatform.googleblog.com/2014/03/geoip-geolocation-with-google-bigquery.html that was linked to in Felipe's answer I was able to put something together that is incredibly fast and scales really well. As Felipe alluded to, the trick is to do a direct join on a prefix (I went with /16), and then filter with a between. I'm pre-processing the ranges to split anything larger than a /16 into multiple blocks. I then overwrite the table with this query, which adds some additional fields:
SELECT *,
NET.SAFE_IP_FROM_STRING(start_ip) AS start_b,
NET.SAFE_IP_FROM_STRING(end_ip) AS end_b,
NET.IP_TRUNC(NET.SAFE_IP_FROM_STRING(start_ip), 16) as prefix
然后,联接查询如下所示:
The join query then looks something like this:
SELECT * FROM `ips` i
JOIN `ranges` a
ON a.prefix = NET.IP_TRUNC(NET.SAFE_IP_FROM_STRING(i.ip), 16)
WHERE NET.SAFE_IP_FROM_STRING(i.ip) BETWEEN a.start_b AND a.end_b
现在,在计费等级1上,将1000万个IP添加到100万个范围只需不到30秒!
Joining 10 million IPs to 1 million ranges now takes less than 30 seconds at billing tier 1!
这篇关于在BigQuery中有效加入IP范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!