Azure 数据湖的 Spark 谓词下推、过滤和分区修剪

本文介绍了Azure 数据湖的 Spark 谓词下推、过滤和分区修剪的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在阅读有关 Spark 谓词下推和分区修剪的信息，以了解读取的数据量.我有以下与此相关的疑问

I had been reading about spark predicates pushdown and partition pruning to understand the amount of data read. I had the following doubts related to the same

假设我有一个包含列的数据集(年份:Int，SchoolName:String，StudentId:Int，SubjectEnrolled:String)其中磁盘上存储的数据按 Year 和 SchoolName 进行分区，并以 parquet 格式存储在 azure 数据湖存储中.

Suppose I have a dataset with columns(Year: Int, SchoolName: String, StudentId: Int, SubjectEnrolled: String)of which the data stored on disk is partitioned by Year and SchoolName and stored in parquet format at say azure data lake storage.

1) 如果我发出 read spark.read(container).filter(Year=2019, SchoolName="XYZ"):

1) If I issue a read spark.read(container).filter(Year=2019, SchoolName="XYZ"):

分区修剪是否会生效并且只会读取有限数量的分区?
blob 存储是否会存在 I/O，数据将被加载到 Spark 集群，然后进行过滤，即我是否必须为我们不需要的所有其他数据的 IO 支付 azure 费用?
如果不是，Azure blob 文件系统如何理解这些过滤器，因为它默认不可查询?

2) 如果我发出 read spark.read(container).filter(StudentId = 43) :

2) If I issue a read spark.read(container).filter(StudentId = 43) :

spark 是否仍将过滤器推送到磁盘并仅读取所需的数据?由于我没有按这个分区，它会理解每一行并根据查询进行过滤吗?
对于查询不需要的所有文件，我是否还需要为 azure 支付 IO 费用?

推荐答案

1) 当您对分区的列使用过滤器时，Spark 将完全跳过这些文件，并且不会花费您任何 IO.如果您查看您的文件结构，它会存储为如下内容:

1) When you use filters on the columns which you did partition on, Spark will skip those files completely and it wouldn't cost you any IO. If you look at your file structure it's stored as something like:

parquet-folder/Year=2019/SchoolName=XYZ/part1.parquet
parquet-folder/Year=2019/SchoolName=XYZ/part2.parquet
parquet-folder/Year=2019/SchoolName=XYZ/...

2) 当您筛选不在分区中的某些列时，Spark 将扫描该镶木地板表的每个文件夹中的每个 part 文件.只有当您进行下推过滤时，Spark 才会使用每个 part 文件的页脚(存储 min、max 和 count 统计信息)来确定您的搜索值是否在该范围内.如果是，Spark 将完全读取文件.如果没有，Spark 将跳过整个文件，至少不会花费你完整的阅读.

2) When you filter on some column that isn't in your partition, Spark will scan every part file in every folder of that parquet table. Only when you have pushdown filtering, Spark will use the footer of every part file (where min, max and count statistics are stored) to determine if your search value is within that range. If yes, Spark will read the file fully. If not, Spark will skip the whole file, not costing you at least the full read.

这篇关于Azure 数据湖的 Spark 谓词下推、过滤和分区修剪的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！