本文介绍了决定何时在 Bi​​gQuery 中对表进行分区的最佳平衡是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在使用公共数据集来对 BigQuery 进行基准测试.我们拿了同一张桌子并按天对它进行了分区,但不清楚我们是否得到了很多好处.什么是好的平衡?

SELECT sum(score)从`fh-bigquery.stackoverflow_archive.201906_posts_questions`WHERE 创建日期 >2019-01-01"

耗时 1 秒,处理 270.7MB.

相同,有分区:

SELECT sum(score)来自`temp.questions_partitioned`WHERE 创建日期 >2019-01-01"

耗时 2 秒,处理 14.3 MB.

所以我们看到处理的 MB 数有好处,但查询速度较慢.

决定何时分区的好策略是什么?

(来自我今天收到的一封电子邮件)

解决方案

在对表进行分区时,需要考虑每个分区有足够的数据.将每个分区视为不同的文件 - 打开 365 个文件可能比打开一个大文件要慢.

在这种情况下,用于基准测试的表有 1.6 GB 的 2019 年数据(直到 6 月).即每个每日分区 1.6GB/180 = 9 MB 数据.

对于这么少的数据——将其安排在日常分区中不会带来太多好处.考虑按年份对数据进行分区.请参阅以下问题以了解操作方法:

  • 另见:

    We are using a public dataset to benchmark BigQuery. We took the same table and partitioned it by day, but it's not clear we are getting many benefits. What's a good balance?

    SELECT  sum(score)
    FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
    WHERE creation_date > "2019-01-01"
    

    Takes 1 second, and processes 270.7MB.

    Same, with partitions:

    SELECT  sum(score)
    FROM `temp.questions_partitioned`
    WHERE creation_date > "2019-01-01"
    

    Takes 2 seconds and processes 14.3 MB.

    So we see a benefit in MBs processed, but the query is slower.

    What's a good strategy to decide when to partition?

    (from an email I received today)

    解决方案

    When partitioning a table, you need to consider having enough data for each partition. Think of each partition like being a different file - and opening 365 files might be slower than having a huge one.

    In this case, the table used for the benchmark has 1.6 GB of data for 2019 (until June in this one). That's 1.6GB/180 = 9 MB of data for each daily partition.

    For such a low amount of data - arranging it in daily partitions won't bring much benefits. Consider partitioning the data by year instead. See the following question to learn how:

    Another alternative is not partitioning the table at all, and instead using clustering to sort the data by date. Then BigQuery can choose the ideal size of each block.

    If you want to run your own benchmarks, do this:

    CREATE TABLE `temp.questions_partitioned`
    PARTITION BY DATE(creation_date)
    AS
    SELECT *
    FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
    

    vs no partitions, just clustering by date:

    CREATE TABLE `temp.questions_clustered`
    PARTITION BY fake_date
    CLUSTER BY creation_date
    AS
    
    SELECT *, DATE('2000-01-01') fake_date
    FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
    

    Then my query over the clustered table would be:

    SELECT sum(score)
    FROM `temp.questions_clustered`
    WHERE creation_date > "2019-01-01"
    

    And it took 0.5 seconds, 17 MB processed.

    Compared:

    • Raw table: 1 sec, 270.7MB
    • Partitioned: 2 sec, 14.3 MB
    • Clustered: 0.5 sec, 17 MB

    We have a winner! Clustering organized the daily data (which isn't much for this table) into more efficient blocks than strictly partitioning it by day.

    It's also interesting to look at the execution details for each query on these tables:

    Slot time consumed

    • Raw table: 10.683 sec
    • Partitioned: 7.308 sec
    • Clustered: 0.718 sec

    As you can see, the query over raw table used a lot of slots (parallelism) to get the results in 1 second. In this case 50 workers processed the whole table with multiple years of data, reading 17.7M rows. The query over the partitioned table had to use a lot of slots - but this because each slot was assigned smallish daily partitions, a reading that used 153 parallel workers over 0.9M rows. The clustered query instead was able to use a very low amount of slots. Data was well organized to be read by 57 parallel workers, reading 1.12M rows.

    See also:

    这篇关于决定何时在 Bi​​gQuery 中对表进行分区的最佳平衡是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-24 13:39