本文介绍了JOIN EACH和GROUP EACH BY子句不能用于窗口函数的输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我试图找到基于使用LEAD和LAG窗口函数的3条记录的序列的流程,而不是计算一些它们的属性的聚合(计数,总和等)。

当我在少量的数据样本上运行查询时,一切都很好,并且组运行正常。但在更大的数据集上运行时,我得到:在查询执行过程中超出资源。查询包含GROUP BY运算符,请考虑使用GROUP EACH BY。



在许多其他情况切换到GROUP EACH BY做这项工作...
但是,由于我使用窗口函数,我无法使用EACH ...

任何建议?最佳做法?



这里是一个基于维基百科样本数据的示例查询。它显示了不同贡献者编辑标题的频率。 where条件只是为了限制响应大小,如果你删除了B,我们得到的结果,如果我们添加它,我们得到了使用EACH推荐。



<$ p $ (contributor_id<> LeadContributor then 1 else null end的情况)不同,
count(contributor_id = LeadContributor则为1,否则为null的情况)相同,


)开始计算(*)总计
作为LeadContributor
FROM [publicdata :samples.wikipedia]
其中regexp_match(title,r'^ [A,B]')= true)
按标题分组

谢谢

解决方案

我想你的具体用例与样本不同查询,但让我评论一下我能看到的内容:




  • 您发现了一种使GROUP EACH和OVER成为可能的方法: OVER()查询与另一个允许您将GROUP BY更改为GRO每一个。但是,这个查询的问题并不存在。
  • 让我们忘记GROUP和GROUP EACH。让我们来看看核心查询:

    pre $ SELECT $ title $ con $ s $ contributor_id $ BY时间戳)AS LeadContributor
    FROM [publicdata:samples.wikipedia]
    WHERE REGEXP_MATCH(title,r'^ [A,B]')

  • 我将在这里应用的解决方法恰恰就是您要做的:数据的一小部分。


    (btw,让我说我喜欢示例查询...这是一个有趣的问题,有一个有趣的答案!)


    How would you overcome the above restriction?

    I am trying to find flows based on sequences of 3 records using the LEAD and LAG window functions, and than calculate some aggregations (count, sum, etc,) of their attributes.

    When i run my queries on a small sample of data, everything is fine and the group by runs OK. but when running on larger data set, i get: "Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead."

    In many other cases switching to GROUP EACH BY do the work...However, as I use window functions, I cannot use EACH...

    Any suggestions? Best practices?

    here is a sample query based of wikipedia sample data. it shows the frequency of title editing by different contributors. the where condition is just to limit response size, if you remove the "B" we get results, if we add it we got the "use EACH" recomendation.

    select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
    count (case when contributor_id=LeadContributor then 1 else null end) as same,
    count(*) as total
    from
    (
    SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor  
    FROM [publicdata:samples.wikipedia]
    where regexp_match(title,r'^[A,B]')=true)
    group by title
    

    Thanks

    解决方案

    I guess your particular use case is different to the sample query, but let me comment on what I'm able to see:

    • You found a way to make GROUP EACH and OVER possible: Surrounding the OVER() query with another one allows you to change the GROUP BY to GROUP EACH BY. However, this query's problem is not there.
    • Let's forget about GROUP and GROUP EACH. Let's look at the core query:

      SELECT title, contributor_id, LEAD(contributor_id)
          OVER(PARTITION BY title ORDER BY timestamp) AS LeadContributor
      FROM [publicdata:samples.wikipedia]
      WHERE REGEXP_MATCH(title, r'^[A,B]')
      

    • This query fails with r'^[A,B]' and works with r'^[A]', and it highlight an OVER() limitation: As GROUP BY and ORDER BY, it only works when data fits in one machine, as they are not parallelizable. As the answer to r'^[A]' reveals, that can be a lot of data - though sometimes not enough. That's why BigQuery offers the parallelizable GROUP EACH BY. However, there is no parallelizable OVER EACH BY we can use here.

    • The workaround I would apply here is exactly what you are doing: Do the OVER() with just a fraction of the data.

    (btw, let me say I love the sample query... it's an interesting question with an interesting answer!)

    这篇关于JOIN EACH和GROUP EACH BY子句不能用于窗口函数的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-24 20:24