确定何时在BigQuery中对表进行分区的最佳平衡是什么?

本文介绍了确定何时在BigQuery中对表进行分区的最佳平衡是什么?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们正在使用公共数据集对BigQuery进行基准测试.我们将同一张桌子划分为一天，但尚不清楚我们会获得很多好处.好的平衡是什么?

We are using a public dataset to benchmark BigQuery. We took the same table and partitioned it by day, but it's not clear we are getting many benefits. What's a good balance?

SELECT  sum(score)
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
WHERE creation_date > "2019-01-01"

耗时1秒，处理270.7MB.

Takes 1 second, and processes 270.7MB.

相同，但带有分区:

SELECT  sum(score)
FROM `temp.questions_partitioned`
WHERE creation_date > "2019-01-01"

耗时2秒，处理14.3 MB.

Takes 2 seconds and processes 14.3 MB.

因此，我们发现在处理的MB中有好处，但查询速度较慢.

So we see a benefit in MBs processed, but the query is slower.

决定何时进行分区的好策略是什么?

What's a good strategy to decide when to partition?

(从我今天收到的电子邮件中)

(from an email I received today)

推荐答案

对表进行分区时，您需要考虑为每个分区有足够的数据.将每个分区都视为一个不同的文件-打开365个文件可能比拥有一个大文件要慢.

When partitioning a table, you need to consider having enough data for each partition. Think of each partition like being a different file - and opening 365 files might be slower than having a huge one.

在这种情况下，用于基准测试的表具有2019年1.6 GB的数据(直到本月的6月).每天每个分区有1.6GB/180 = 9 MB的数据.

In this case, the table used for the benchmark has 1.6 GB of data for 2019 (until June in this one). That's 1.6GB/180 = 9 MB of data for each daily partition.

对于如此低的数据量-将其安排在每日分区中不会带来太多好处.考虑改为按年份对数据进行分区.请参阅以下问题以了解操作方法:

For such a low amount of data - arranging it in daily partitions won't bring much benefits. Consider partitioning the data by year instead. See the following question to learn how:

按周/月///季度超过分区限制?

另一种选择是根本不对表进行分区，而是使用聚类按日期对数据进行排序.然后BigQuery可以选择每个块的理想大小.

Another alternative is not partitioning the table at all, and instead using clustering to sort the data by date. Then BigQuery can choose the ideal size of each block.

如果您要运行自己的基准测试，请执行以下操作:

If you want to run your own benchmarks, do this:

CREATE TABLE `temp.questions_partitioned`
PARTITION BY DATE(creation_date)
AS
SELECT *
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`

没有分区，只是按日期聚类:

vs no partitions, just clustering by date:

CREATE TABLE `temp.questions_clustered`
PARTITION BY fake_date
CLUSTER BY creation_date
AS

SELECT *, DATE('2000-01-01') fake_date
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`

然后我对集群表的查询将是:

Then my query over the clustered table would be:

SELECT sum(score)
FROM `temp.questions_clustered`
WHERE creation_date > "2019-01-01"

花了0.5秒，处理了17 MB.

And it took 0.5 seconds, 17 MB processed.

已比较:

原始表:1秒，270.7MB
分区:2秒，14.3 MB
集群:0.5秒，17 MB

我们有一个赢家！群集将每日数据(对于此表来说不算多)将其组织成比按日严格划分数据更为有效的块.

We have a winner! Clustering organized the daily data (which isn't much for this table) into more efficient blocks than strictly partitioning it by day.

看看这些表上每个查询的执行细节也很有趣:

It's also interesting to look at the execution details for each query on these tables:

已消耗的插槽时间

原始表:10.683秒
分区:7.308秒
集群:0.718秒

如您所见，对原始表的查询使用了大量的插槽(并行度)以在1秒内获得结果.在这种情况下，有50名工人用多年的数据处理了整个表，读取了1770万行.分区表上的查询必须使用很多插槽-但这是因为每个插槽都分配了较小的每日分区，该读取使用了超过0.9M行的153个并行工作器.相反，集群查询能够使用非常少量的插槽.数据井井有条，可以由57个并行的工作人员读取，读取112万行.

As you can see, the query over raw table used a lot of slots (parallelism) to get the results in 1 second. In this case 50 workers processed the whole table with multiple years of data, reading 17.7M rows. The query over the partitioned table had to use a lot of slots - but this because each slot was assigned smallish daily partitions, a reading that used 153 parallel workers over 0.9M rows. The clustered query instead was able to use a very low amount of slots. Data was well organized to be read by 57 parallel workers, reading 1.12M rows.

另请参阅:

https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
如何改善使用分区+集群表查询的数据量?
集群在BigQuery中的工作方式

https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
How can I improve the amount of data queried with a partitioned+clustered table?
how clustering works in BigQuery

这篇关于确定何时在BigQuery中对表进行分区的最佳平衡是什么?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！