我已将应用程序日志加载到BigQuery中,并且需要根据这些日志中的IP地址来计算国家/地区。

我已经在表和从MaxMind下载的GeoIP映射表之间编写了一个联接查询。

理想的查询是带有范围过滤器的OUTER JOIN,但是BQ在连接条件中仅支持=
因此,查询执行INNER JOIN并处理JOIN每一侧的缺失值。

我已经修改了原始查询,以便可以在Wikipedia公共(public)数据集上运行。

有人可以帮我加快运行速度吗?

SELECT id, client_ip, client_ip_code, B.Country_Name as Country_Name

FROM
    (SELECT id, contributor_ip as client_ip, INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code, 1 AS One
    FROM [publicdata:samples.wikipedia] Limit 1000) AS A1

JOIN
    (SELECT From_IP_Code, To_IP_Code, Country_Name, 1 AS One
    FROM

        -- 3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the end of the set
        -- all Ranges of valid IPs:
        (SELECT From_IP_Code, To_IP_Code, Country_Name FROM [QA_DATASET.GeoIP])

        -- Missing rages lower from From_IP
        ,(SELECT
            PriorRangeEndIP + 1 From_IP_Code,
            From_IP_Code - 1 AS To_IP_Code,
            'NA' AS Country_Name
        FROM

            -- use of LAG function to find prior valid range
            (SELECT
                From_IP_Code,
                To_IP_Code, Country_Name,
                LAG(To_IP_Code, 1, INTEGER(0))
                OVER(ORDER BY From_IP_Code asc) PriorRangeEndIP
            FROM [QA_DATASET.GeoIP]) A

            -- If gap from prior valid range is > 1 than its a gap to fill
            WHERE From_IP_Code > PriorRangeEndIP + 1)

        -- Missing rages higher tan Max To_IP
        ,(SELECT MAX(To_IP_Code) + 1 as From_IP_Code, INTEGER(4311810304) as To_IP_Code, 'NA' AS Country_Name
        FROM [QA_DATASET.GeoIP])
    ) AS B
ON A1.ONE = B.ONE    -- fake join condition to overcome allowed use of only = in joins

-- Join condition where valid IP exists on left
WHERE
    A1.client_ip_code >= B.From_IP_Code
    AND A1.client_ip_code <= B.To_IP_Code
    OR (A1.client_ip_code IS NULL
    AND B.From_IP_Code = 1)    -- where there is no valid IP on left contributor_ip

最佳答案

2019,答案大大改善了:

  • https://medium.com/@hoffa/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2
  • #standardSQL
    # replace with your source of IP addresses
    # here I'm using the same Wikipedia set from the previous article
    WITH source_of_ip_addresses AS (
      SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0')  ip, COUNT(*) c
      FROM `publicdata.samples.wikipedia`
      WHERE contributor_ip IS NOT null
      GROUP BY 1
    )
    SELECT country_name, SUM(c) c
    FROM (
      SELECT ip, country_name, c
      FROM (
        SELECT *, NET.SAFE_IP_FROM_STRING(ip) & NET.IP_NET_MASK(4, mask) network_bin
        FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
        WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
      )
      JOIN `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs`
      USING (network_bin, mask)
    )
    GROUP BY 1
    ORDER BY 2 DESC
    

    在以下位置清理了此答案的版本:
    http://googlecloudplatform.blogspot.com/2014/03/geoip-geolocation-with-google-bigquery.html

    让我整理一下原始查询:
    SELECT
      id,
      client_ip,
      client_ip_code,
      B.Country_Name AS Country_Name
    FROM (
      SELECT
        id,
        contributor_ip AS  client_ip,
        INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
        1 AS One
      FROM
        [publicdata:samples.wikipedia]
      WHERE contributor_ip IS NOT NULL
      LIMIT
        1000
        ) AS A1
    LEFT JOIN
      (
      SELECT
        From_IP_Code,
        To_IP_Code,
        Country_Name,
        1 AS One
      FROM
        --3 IP sets: 1.valid ranges,  2.Gaps,  3. Gap at the END of the set
        (
        SELECT
          From_IP_Code,
          To_IP_Code,
          Country_Name
        FROM
          [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
        ,
        (
        SELECT
          PriorRangeEndIP+1 From_IP_Code,
          From_IP_Code-1 AS To_IP_Code,
          'NA' AS Country_Name -- Missing rages lower    FROM      From_IP
        from(
          SELECT
            From_IP_Code,
            To_IP_Code,
            Country_Name
            ,
            LAG(To_IP_Code,
              1,
              INTEGER(0)) OVER(
            ORDER BY
              From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
          FROM
            [playscape-proj:GeoIP.GeoIP])A
        WHERE
         From_IP_Code>PriorRangeEndIP+1) -- If gap  FROM  prior valid range IS >1 than its a gap to fill
          ,
        (
        SELECT
          MAX(To_IP_Code)+1 AS From_IP_Code,
          INTEGER (4311810304) AS To_IP_Code,
          'NA' AS Country_Name -- Missing rages higher tan Max To_IP
        FROM
          [playscape-proj:GeoIP.GeoIP])
        ) AS B
      ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
    WHERE
      A1.client_ip_code>=B.From_IP_Code
      AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
      OR (A1.client_ip_code IS NULL
        AND B.From_IP_Code=1 ) -- WHERE  there IS no valid IP ON left contributor_ip;
    

    这是一个漫长的查询! (和一个非常有趣的)。它会在14秒内运行。我们如何优化它?

    我发现了一些技巧:
  • 跳过NULL。如果日志中没有IP地址,请不要尝试匹配它。
  • 减少组合。与仅将左侧的39.x.x.x记录与右侧的39.x.x.x记录联接在一起,而不是将每个左侧的记录与每个右侧的记录联接在一起。只有极少数(3或4)条规则涵盖多个范围。在geolite表上添加几个规则以添加规则来弥补这些差距将很容易。

  • 所以我在改变:
  • 1 AS OneINTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One(两次)。
  • 添加“WHERE contributor_ip不为空”。

  • 现在,它可以在3秒钟内运行! 5%的ip无法定位,可能是由于所描述的差距(易于修复)。

    现在,如何从LIMIT 1000变为LIMIT300000。需要多长时间?

    37秒!比描述的25分钟要好得多。如果您想进一步提高,我建议将右侧 table 变成一个静态 table ,因为一旦计算,它根本不会改变,这只是基本规则的扩展。然后,您可以使用JOIN EACH。
    SELECT
      id,
      client_ip,
      client_ip_code,
      B.Country_Name AS Country_Name
    FROM (
      SELECT
        id,
        contributor_ip AS  client_ip,
        INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
        INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One
      FROM
        [publicdata:samples.wikipedia]
      WHERE contributor_ip IS NOT NULL
      LIMIT
        300000
        ) AS A1
    JOIN
      (
      SELECT
        From_IP_Code,
        To_IP_Code,
        Country_Name,
        INTEGER(From_IP_Code/(256*256*256)) AS One
      FROM
        --3 IP sets: 1.valid ranges,  2.Gaps,  3. Gap at the END of the set
        (
        SELECT
          From_IP_Code,
          To_IP_Code,
          Country_Name
        FROM
          [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
        ,
        (
        SELECT
          PriorRangeEndIP+1 From_IP_Code,
          From_IP_Code-1 AS To_IP_Code,
          'NA' AS Country_Name -- Missing rages lower    FROM      From_IP
        from(
          SELECT
            From_IP_Code,
            To_IP_Code,
            Country_Name
            ,
            LAG(To_IP_Code,
              1,
              INTEGER(0)) OVER(
            ORDER BY
              From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
          FROM
            [playscape-proj:GeoIP.GeoIP])A
        WHERE
         From_IP_Code>PriorRangeEndIP+1) -- If gap  FROM  prior valid range IS >1 than its a gap to fill
          ,
        (
        SELECT
          MAX(To_IP_Code)+1 AS From_IP_Code,
          INTEGER (4311810304) AS To_IP_Code,
          'NA' AS Country_Name -- Missing rages higher tan Max To_IP
        FROM
          [playscape-proj:GeoIP.GeoIP])
        ) AS B
      ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
    WHERE
      A1.client_ip_code>=B.From_IP_Code
      AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
      OR (A1.client_ip_code IS NULL
        AND B.From_IP_Code=1 ) -- WHERE  there IS no valid IP ON left contributor_ip;
    

    10-01 00:40
    查看更多