问题描述
我已经在BigQuery中加载了我的应用程序日志,并且需要根据这些日志中的IP地址计算国家。
我已经在我的表和我从下载的GeoIP映射表。
一个理想的查询将是带有范围过滤器的 OUTER JOIN ,但是 BQ 仅支持 = 在连接条件中。
因此,查询执行 INNER JOIN 并处理 JOIN 中每一侧的缺失值。 p>
我修改了我的原始查询,以便它可以运行在维基百科公共数据集上。
有人可以帮我吗使这个运行更快?
SELECT id,client_ip,client_ip_code,B.Country_Name as Country_Name
FROM
(SELECT id,contributor_ip as client_ip,INTEGER(PARSE_IP(contributor_ip))AS client_ip_code,1 AS One
FROM [publicdata:samples.wikipedia]限制1000)AS A1
JOIN
(选择From_IP_Code,To_IP_Code,Country_Name,1 AS一个
FROM
- 3个IP集合:1.有效范围,2.差距,3.最后差距的集合
- 所有有效IP的范围:
(SELECT From_IP_Code,To_IP_Code,Country_Name FROM [QA_DATASET.GeoIP])
- 缺少From_IP $ b的怒气$ b,(SELECT
PriorRangeEndIP + 1 From_ IP_Code,
From_IP_Code - 1 AS To_IP_Code,
'NA'AS Country_Name
FROM
- 使用LAG函数查找先前的有效范围
( SELECT
From_IP_Code,
To_IP_Code,Country_Name,
LAG(To_IP_Code,1,INTEGER(0))
OVER(ORDER BY From_IP_Code asc)PriorRangeEndIP
FROM [QA_DATASET。 GeoIP])A
- 如果与先前有效范围的差距>> 1比填补
的差距WHERE From_IP_Code> PriorRangeEndIP + 1)
- 丢失的怒气更高tan最大To_IP
,(SELECT MAX(To_IP_Code)+1作为From_IP_Code,INTEGER(4311810304)作为To_IP_Code,'NA'AS Country_Name
FROM [QA_DATASET.GeoIP])
)AS B
ON A1.ONE = B.ONE - 假连接条件克服在连接中只允许使用= b
$ b - 加入左边存在有效IP的条件
WHERE
A1.client_ip_code> = B.From_IP_Code
AND A1.client_ip_code OR(A1。 client_ip_code IS NULL
AND B.From_IP_Code = 1) - 左边没有有效IP contributor_ip
SELECT
id,
client_ip,
client_ip_code,
B.Country_Name AS Country_Name
FROM(
SELECT
id,
contributor_ip AS client_ip,
INTEGER(PARSE_IP(contributor_ip))AS client_ip_code,
1 AS
FROM
[publicdata:samples.wikipedia]
WHERE contributor_ip不是NULL
LIMIT
1000
)AS A1
LEFT JOIN
SELECT
From_IP_Code,
To_IP_Code,
Country_Name,
1 AS
FROM
--3 IP集:1.有效范围, (
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
FROM
[ ) - 所有范围ov有效IP
,
(
SELECT
PriorRangeEndIP + 1 From_IP_Code,
From_IP_Code-1 AS To_IP_Code,
'NA'AS Country_Name - 缺少的怒气低于FROM From_IP
from(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
,
LAG(To_IP_Code,
1,
INTEGER(0))OVER(
ORDER BY
From_IP_Code ASC)PriorRangeEndIP - 使用LAG函数查找先前的有效范围
FROM
[playscape-proj:GeoIP。 GeoIP])A
WHERE
From_IP_Code> PriorRangeEndIP + 1) - 如果与先前有效范围IS的差距大于1,那么它与填充
的差距,
(
SELECT
MAX(To_IP_Code)+1 AS From_IP_Code,
INTEGER(4311810304)AS To_IP_Code,
'NA'AS Country_Name - 丢失的怒气更高tan最大值To_IP
FROM
[playscape-proj:GeoIP.GeoIP])
)AS B
ON A1.ONE = B.ONE - 使JOIN条件克服允许使用= only IN连接
WHERE
A1.client_ip_code> = B.From_IP_Code
AND A1.client_ip_code OR(A1.client_ip_code IS NULL
AND B.From_IP_Code = 1) - WHERE不存在有效IP ON left contributor_ip;
这是一个长查询! (和一个非常有趣的)。它在14秒内运行。
$ b 跳过空白。如果日志中没有ip地址,请不要尝试匹配它。
所以我正在改变:
- 1 AS One to INTEGER(PARSE_IP
- 添加一个'WHERE contributor_ip不为空'。 $($ contrib_ip)/(256 * 256 * 256))AS One b $ b
现在它在3秒内运行! 5%的ips不能被定位,可能是由于所描述的差距(简单修复)。
现在,从LIMIT 1000到LIMIT 300000的过程如何?会花费吗?
!比描述的25分钟好得多。如果你想走得更高,我会建议把右边的桌子变成一个静态桌子 - 就像曾经计算过的那样,它根本不会改变,这只是基本规则的扩展。然后你可以使用JOIN EACH。
pre $ SELECT
id
client_ip
client_ip_code,
B.Country_Name AS Country_Name
FROM(
SELECT
id,
contributor_ip AS client_ip,
INTEGER(PARSE_IP(contributor_ip))AS client_ip_code,
INTEGER(PARSE_IP(contributor_ip)/(256 * 256 * 256))AS
FROM
[publicdata:samples.wikipedia]
WHERE contributor_ip不是NULL
LIMIT
300000
)AS A1
JOIN
(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name,
INTEGER( From_IP_Code /(256 * 256 * 256))AS
FROM
--3 IP集合:1.有效范围,2.Gaps,3.集合结束处的空位
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
FROM
[playscape-proj:GeoIP.GeoIP]) - 所有范围ov有效IP
,
(
SELECT
PriorRangeEndIP + 1 From_IP_Code,
From_IP_Code-1 AS To_IP_Code,
'NA'AS Country_Name - 缺少愤怒低于FROM_IP
from(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
,
LAG(To_IP_Code,
1,
INTEGER(0))OVER(
ORDER BY
From_IP_Code ASC)PriorRangeEndIP - 使用LAG函数查找先前的有效范围
FROM
[playscape-proj:GeoIP.GeoIP])A
WHERE
From_IP_Code> PriorRangeEndIP + 1) - 如果与先前有效范围的差距IS> 1比填补
,
(
SELECT
MAX(To_IP_Code)+1 AS From_IP_Code,
INTEGER(4311810304)AS To_IP_Code,
'NA'AS Country_Name - 丢失的怒气更高tan最大值To_IP
来自
[playscape-proj:GeoIP.GeoIP])
)作为B
对于A1.ONE = B.ONE - 使JOIN条件克服允许的使用=只有IN连接
WHERE
A1.client_ip_code> = B.From_IP_Code
AND A1.client_ip_code OR(A1.client_ip_code IS NULL
和B.From_IP_Code = 1) - 哪里没有有效的IP ON left contributor_ip;
I have loaded my application logs in BigQuery and I need to calculate country based on IP address from those logs.
I have written a join query between my table and a GeoIP mapping table that I downloaded from MaxMind.
An ideal query would be OUTER JOIN with range filter, however BQ supports only = in join conditions.So the query does an INNER JOIN and handles missing values in each side of the JOIN.
I have amended my original query so it could run on the Wikipedia public data set.
Can someone please help me make this run faster?
SELECT id, client_ip, client_ip_code, B.Country_Name as Country_Name FROM (SELECT id, contributor_ip as client_ip, INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code, 1 AS One FROM [publicdata:samples.wikipedia] Limit 1000) AS A1 JOIN (SELECT From_IP_Code, To_IP_Code, Country_Name, 1 AS One FROM -- 3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the end of the set -- all Ranges of valid IPs: (SELECT From_IP_Code, To_IP_Code, Country_Name FROM [QA_DATASET.GeoIP]) -- Missing rages lower from From_IP ,(SELECT PriorRangeEndIP + 1 From_IP_Code, From_IP_Code - 1 AS To_IP_Code, 'NA' AS Country_Name FROM -- use of LAG function to find prior valid range (SELECT From_IP_Code, To_IP_Code, Country_Name, LAG(To_IP_Code, 1, INTEGER(0)) OVER(ORDER BY From_IP_Code asc) PriorRangeEndIP FROM [QA_DATASET.GeoIP]) A -- If gap from prior valid range is > 1 than its a gap to fill WHERE From_IP_Code > PriorRangeEndIP + 1) -- Missing rages higher tan Max To_IP ,(SELECT MAX(To_IP_Code) + 1 as From_IP_Code, INTEGER(4311810304) as To_IP_Code, 'NA' AS Country_Name FROM [QA_DATASET.GeoIP]) ) AS B ON A1.ONE = B.ONE -- fake join condition to overcome allowed use of only = in joins -- Join condition where valid IP exists on left WHERE A1.client_ip_code >= B.From_IP_Code AND A1.client_ip_code <= B.To_IP_Code OR (A1.client_ip_code IS NULL AND B.From_IP_Code = 1) -- where there is no valid IP on left contributor_ip
Cleaned up version of this answer at:http://googlecloudplatform.blogspot.com/2014/03/geoip-geolocation-with-google-bigquery.html
Let me tidy the original query:
SELECT id, client_ip, client_ip_code, B.Country_Name AS Country_Name FROM ( SELECT id, contributor_ip AS client_ip, INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code, 1 AS One FROM [publicdata:samples.wikipedia] WHERE contributor_ip IS NOT NULL LIMIT 1000 ) AS A1 LEFT JOIN ( SELECT From_IP_Code, To_IP_Code, Country_Name, 1 AS One FROM --3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the END of the set ( SELECT From_IP_Code, To_IP_Code, Country_Name FROM [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs , ( SELECT PriorRangeEndIP+1 From_IP_Code, From_IP_Code-1 AS To_IP_Code, 'NA' AS Country_Name -- Missing rages lower FROM From_IP from( SELECT From_IP_Code, To_IP_Code, Country_Name , LAG(To_IP_Code, 1, INTEGER(0)) OVER( ORDER BY From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range FROM [playscape-proj:GeoIP.GeoIP])A WHERE From_IP_Code>PriorRangeEndIP+1) -- If gap FROM prior valid range IS >1 than its a gap to fill , ( SELECT MAX(To_IP_Code)+1 AS From_IP_Code, INTEGER (4311810304) AS To_IP_Code, 'NA' AS Country_Name -- Missing rages higher tan Max To_IP FROM [playscape-proj:GeoIP.GeoIP]) ) AS B ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins WHERE A1.client_ip_code>=B.From_IP_Code AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left OR (A1.client_ip_code IS NULL AND B.From_IP_Code=1 ) -- WHERE there IS no valid IP ON left contributor_ip;
That's a long query! (and a very interesting one). It runs in 14 seconds. How can we optimize it?
Some tricks I found:
- Skip NULLs. If there is no ip address in a log, don't try to match it.
- Reduce the combinations. Instead of JOINing every left side record with every right side record, how about joining only the 39.x.x.x records on the left side with the 39.x.x.x records on the right side. There are only a few (3 or 4) rules that cover multiple ranges. It would be easy to add a couple of rules on the geolite table to add rules to cover these gaps.
So I'm changing:
- 1 AS One to INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One (twice).
- Adding a 'WHERE contributor_ip IS NOT NULL`.
And now it runs in 3 seconds! 5% of the ips could not be geolocated, probably by the described gaps (easy fix).
Now, how about going from the LIMIT 1000 to LIMIT 300000. How long will it take?
37 seconds! Much better than the described 25 minutes. If you want to go even higher, I would suggest turning the right side table into a static one - as once computed it doesn't change at all, it's just an expansion of the basic rules. Then you can use JOIN EACH.
SELECT id, client_ip, client_ip_code, B.Country_Name AS Country_Name FROM ( SELECT id, contributor_ip AS client_ip, INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code, INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One FROM [publicdata:samples.wikipedia] WHERE contributor_ip IS NOT NULL LIMIT 300000 ) AS A1 JOIN ( SELECT From_IP_Code, To_IP_Code, Country_Name, INTEGER(From_IP_Code/(256*256*256)) AS One FROM --3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the END of the set ( SELECT From_IP_Code, To_IP_Code, Country_Name FROM [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs , ( SELECT PriorRangeEndIP+1 From_IP_Code, From_IP_Code-1 AS To_IP_Code, 'NA' AS Country_Name -- Missing rages lower FROM From_IP from( SELECT From_IP_Code, To_IP_Code, Country_Name , LAG(To_IP_Code, 1, INTEGER(0)) OVER( ORDER BY From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range FROM [playscape-proj:GeoIP.GeoIP])A WHERE From_IP_Code>PriorRangeEndIP+1) -- If gap FROM prior valid range IS >1 than its a gap to fill , ( SELECT MAX(To_IP_Code)+1 AS From_IP_Code, INTEGER (4311810304) AS To_IP_Code, 'NA' AS Country_Name -- Missing rages higher tan Max To_IP FROM [playscape-proj:GeoIP.GeoIP]) ) AS B ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins WHERE A1.client_ip_code>=B.From_IP_Code AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left OR (A1.client_ip_code IS NULL AND B.From_IP_Code=1 ) -- WHERE there IS no valid IP ON left contributor_ip;
这篇关于如何提高BigQuery中GeoIP查询的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!