本文介绍了在Google BigQuery中做一个交叉枢纽的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经问过有关在X轴上进行多级聚合查询的先前问题:

我能够在第二个查询中构建Y查询-

 使用AS(SELECT family_id来自`patents-public-data.patents.publications`GROUP BY family_id按计数排序(1)DESC极限2),BAS(选择family_id,优先日期,COUNT(1)priority_date_count来自`patents-public-data.patents.publications`WHERE family_id IN(从A中选择family_id)GROUP BY family_id,priority_date),C AS(选择family_id,优先日期,priority_date_count,DENSE_RANK()OVER(PARTITION BY family_id ORDER BY priority_date_count DESC)AS priority_date_rank从B)选择family_id,优先日期,priority_date_count来自C在哪里priority_date_rank< = 2 

但是,我不太确定如何在单个查询或两个查询中将它们合并在一起.

解决方案

下面是BigQuery标准SQL的演示,仅是该方法的演示,并不假装100%表示请求的逻辑

 与A_X AS(从`patents-public-data.patents.publications`中选择country_codeGROUP BY country_code ORDER BY COUNT(1)DESC LIMIT 2),B_X AS(SELECT country_code,application_kind,COUNT(1)application_kind_count从`patents-public-data.patents.publications`中的country_code输入(从A_X中选择country_code)GROUP BY国家/地区代码,application_kind),C_X AS(SELECT country_code,application_kind,application_kind_count,DENSE_RANK()OVER(按国家(地区)代码排序或按application_kind_count DESC排序)AS application_kind_rank来自B_X),X AS(选择国家/地区代码,application_kind,application_kind_count从C_X WHERE application_kind_rank< = 2),A_Y AS(从`patents-public-data.patents.publications`中选择family_id加入X使用(country_code,application_kind)GROUP BY family_id按计数排序(1)上限2),B_Y AS(SELECT family_id,priority_date,COUNT(1)priority_date_count从`patents-public-data.patents.publications`中的family_id输入(从A_Y选择SELECT family_id)GROUP BY family_id,priority_date),C_Y AS(SELECT family_id,priority_date,priority_date_count,DENSE_RANK()OVER(PARTITION BY family_id ORDER BY priority_date_count DESC)AS pos_date来自B_Y),Y AS(SELECT family_id,priority_date,pos_date,DENSE_RANK()OVER(ORDER BY family_id)pos_family从C_Y WHERE pos_date< = 2)SELECT country_code,application_kind,COUNTIF(pos_family = 1 AND pos_date = 1)`family1_date1`,COUNTIF(pos_family = 1 AND pos_date = 2)`family1_date2`,COUNTIF(pos_family = 2 AND pos_date = 1)`family2_date1`,COUNTIF(pos_family = 2 AND pos_date = 2)`family2_date2`来自`patents-public-data.patents.publications`加入您的使用情况(family_id,priority_date)WHERE country_code IN(从X中选择country_code)AND application_kind IN(从x中选择application_kind)GROUP BY国家/地区代码,application_kind 

结果是

很明显,由于相交逻辑,上面有零个数字

I have asked a previous question about doing a multi-level aggregation query on the X-axis here: Get the top patent countries, codes in a BQ public dataset.

Here is how the query (copied from the accepted answer works) to get:

Top 2 Countries by Count, and within those countries, top 2 Codes by Count

WITH A AS (
  SELECT country_code
  FROM `patents-public-data.patents.publications`
  GROUP BY country_code
  ORDER BY COUNT(1) DESC
  LIMIT 2
), B AS (
  SELECT
    country_code,
    application_kind,
    COUNT(1) application_kind_count
  FROM `patents-public-data.patents.publications`
  WHERE country_code IN (SELECT country_code FROM A)
  GROUP BY country_code, application_kind
), C AS (
  SELECT
    country_code,
    application_kind,
    application_kind_count,
    DENSE_RANK() OVER(PARTITION BY country_code ORDER BY application_kind_count DESC) AS application_kind_rank
  FROM B
)
SELECT
  country_code,
  application_kind,
  application_kind_count
FROM C
WHERE application_kind_rank <= 2  

And I get something like:

country_code     application_kind   count
JP               A                  125
JP               U                  124
CN               A                  118
CN               U                  101

Now I would like to add the following pivot on the y-axis: to get the following:

  • X: Top 2 Countries by Count, and within those countries, top 2 Codes by Count
  • Y: Top 2 family_id by Count, Top 2 priority_date by Count

The final results would then look like:

I am able to build the Y-query in a second query --

WITH A AS (
  SELECT family_id
  FROM `patents-public-data.patents.publications`
  GROUP BY family_id
  ORDER BY COUNT(1) DESC
  LIMIT 2
), B AS (
  SELECT
    family_id,
    priority_date,
    COUNT(1) priority_date_count
  FROM `patents-public-data.patents.publications`
  WHERE family_id IN (SELECT family_id FROM A)
  GROUP BY family_id, priority_date
), C AS (
  SELECT
    family_id,
    priority_date,
    priority_date_count,
    DENSE_RANK() OVER(PARTITION BY family_id ORDER BY priority_date_count DESC) AS priority_date_rank
  FROM B
)
SELECT
  family_id,
  priority_date,
  priority_date_count
FROM C
WHERE priority_date_rank <= 2 

However, I am not quite sure how to merge them together, in a single query or in two.

解决方案

Below is for BigQuery Standard SQL and is just demo of the approach and not pretending to be 100% representing requested logic

WITH A_X AS (
  SELECT country_code FROM `patents-public-data.patents.publications`
  GROUP BY country_code ORDER BY COUNT(1) DESC LIMIT 2
), B_X AS (
  SELECT country_code, application_kind, COUNT(1) application_kind_count
  FROM `patents-public-data.patents.publications` WHERE country_code IN (SELECT country_code FROM A_X)
  GROUP BY country_code, application_kind
), C_X AS (
  SELECT country_code, application_kind, application_kind_count,
    DENSE_RANK() OVER(PARTITION BY country_code ORDER BY application_kind_count DESC) AS application_kind_rank
  FROM B_X
), X AS (
  SELECT country_code, application_kind, application_kind_count
  FROM C_X WHERE application_kind_rank <= 2  
), A_Y AS (
  SELECT family_id FROM `patents-public-data.patents.publications` 
  JOIN X USING(country_code, application_kind)
  GROUP BY family_id 
  ORDER BY COUNT(1) DESC LIMIT 2
), B_Y AS (
  SELECT family_id, priority_date, COUNT(1) priority_date_count
  FROM `patents-public-data.patents.publications` WHERE family_id IN (SELECT family_id FROM A_Y) 
  GROUP BY family_id, priority_date
), C_Y AS (
  SELECT family_id, priority_date, priority_date_count,
    DENSE_RANK() OVER(PARTITION BY family_id ORDER BY priority_date_count DESC) AS pos_date
  FROM B_Y
), Y AS (
  SELECT family_id, priority_date, pos_date, DENSE_RANK() OVER(ORDER BY family_id) pos_family
  FROM C_Y WHERE pos_date <= 2 
)
SELECT country_code, application_kind,
  COUNTIF(pos_family = 1 AND pos_date = 1) `family1_date1`,
  COUNTIF(pos_family = 1 AND pos_date = 2) `family1_date2`,
  COUNTIF(pos_family = 2 AND pos_date = 1) `family2_date1`,
  COUNTIF(pos_family = 2 AND pos_date = 2) `family2_date2`
FROM `patents-public-data.patents.publications` 
JOIN Y USING(family_id, priority_date)
WHERE country_code IN (SELECT country_code FROM X)
AND application_kind IN (SELECT application_kind FROM x) 
GROUP BY country_code, application_kind 

the result is

Obviously, there are number of zeroes above because of intersection logic

这篇关于在Google BigQuery中做一个交叉枢纽的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-25 10:01