查询执行错误期间超出了资源，Google BigQuery

本文介绍了查询执行错误期间超出了资源，Google BigQuery的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有什么想法可以使该查询在Google BigQuery上返回结果?我收到超出资源的错误...数据集中大约有2B行.我正在尝试获取每个user_id最多出现的艺术家ID.

Any ideas how to make this query return results on Google BigQuery? I'm getting a resources exceeded error... There are about 2B rows in the dataset. I'm trying to get the artist ID that appears the most for each user_id.

select user_id, artist, count(*) as count
from [legacy20130831.merged_data] as d
group each by user_id, artist
order by user_id ASC, count DESC

推荐答案

对公共数据的等效查询会引发相同的错误:

An equivalent query on public data, that throws the same error:

SELECT actor, repository_name, count(*) AS count
FROM [githubarchive:github.timeline] AS d
GROUP EACH BY actor, repository_name
ORDER BY actor, count desc

与相同的查询进行比较，并限制要返回的结果.这个有效(对我来说是14秒):

Compare with the same query, plus a limit on the results to be returned. This one works (14 seconds for me):

SELECT actor, repository_name, count(*) as count
FROM [githubarchive:github.timeline] as d
GROUP EACH BY actor, repository_name
ORDER BY actor, count desc
LIMIT 100

您可以使用一小部分user_id来代替LIMIT.在我的情况下，1/3有效:

Instead of using a LIMIT, you could go through a fraction of the user_ids. In my case, a 1/3 works:

SELECT actor, repository_name, count(*) as count
FROM [githubarchive:github.timeline] as d
WHERE ABS(HASH(actor) % 3)  = 0
GROUP EACH BY actor, repository_name

但是，您真正想要的是获取对于每个user_id最多出现的艺术家ID".让我们走得更远，得到它:

But what you really want is "to get the artist ID that appears the most for each user_id". Let's go further, and get that:

SELECT actor, repository_name, count FROM (
  SELECT actor, repository_name, count, ROW_NUMBER() OVER (PARTITION BY actor ORDER BY count DESC) rank FROM (
    SELECT actor, repository_name, count(*) as count
    FROM [githubarchive:github.timeline] as d
    WHERE ABS(HASH(actor) % 10) = 0
    GROUP EACH BY actor, repository_name
))
WHERE rank=1

请注意，这一次我使用了％10，因为它可以使我更快地获得结果.但是您可能想知道我想通过一个查询而不是10个查询来获得结果."

Note that this time I used %10, as it gets me results faster. But you might be wondering "I want to get my results with one query, not 10".

您可以执行以下两项操作:

There are 2 things you can do for that:

联合分区表(FROM表达式中的逗号进行并集，而不是BigQuery中的联接).
如果您仍然超出资源，则可能需要实现该表.运行原始查询，然后将结果保存到新表中.在该表上而不是内存组中运行RANK()算法.

如果您愿意与我共享您的数据集，我可以提供特定于数据集的建议(很大程度上取决于基数).

If you are willing to share your dataset with me, I could provide dataset specific advice (a lot depends on cardinality).

这篇关于查询执行错误期间超出了资源，Google BigQuery的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

DataSet