问题描述
我一直在阅读有关火花谓词下推和分区修剪的信息,以了解读取的数据量.我对此有以下疑问
I had been reading about spark predicates pushdown and partition pruning to understand the amount of data read. I had the following doubts related to the same
假设我有一个包含列的数据集 (年份:国际,学校名称:字符串,学生ID:国际,已注册学科:字符串) 其中存储在磁盘上的数据按Year和SchoolName进行分区,并以拼花格式存储在例如Azure Data Lake存储器中.
Suppose I have a dataset with columns(Year: Int, SchoolName: String, StudentId: Int, SubjectEnrolled: String)of which the data stored on disk is partitioned by Year and SchoolName and stored in parquet format at say azure data lake storage.
1)如果我发出read spark.read(container).filter(Year = 2019,SchoolName ="XYZ"):
1) If I issue a read spark.read(container).filter(Year=2019, SchoolName="XYZ"):
- 分区修剪将生效并且仅读取有限数量的分区吗?
- blob存储上是否会有I/O,数据将被加载到Spark集群然后进行过滤,即我是否需要为不需要的所有其他数据的IO支付Azure费用?
- 如果不是,由于默认情况下不可查询azure blob文件系统如何理解这些过滤器?
2)如果我发出读取spark.read(container).filter(StudentId = 43):
2) If I issue a read spark.read(container).filter(StudentId = 43) :
- 火花会否将过滤器仍推入磁盘,仅读取所需的数据?既然我没有按此分区,它会理解每一行并根据查询进行过滤吗?
- 我是否还要为查询中不需要的所有文件支付IO的费用?
推荐答案
1)当在进行分区的列上使用过滤器时,Spark将完全跳过这些文件,并且不会花费任何IO.如果您查看文件结构,它会以类似以下内容的形式存储:
1) When you use filters on the columns which you did partition on, Spark will skip those files completely and it wouldn't cost you any IO. If you look at your file structure it's stored as something like:
parquet-folder/Year=2019/SchoolName=XYZ/part1.parquet
parquet-folder/Year=2019/SchoolName=XYZ/part2.parquet
parquet-folder/Year=2019/SchoolName=XYZ/...
2)当您对不在分区中的某个列进行过滤时,Spark将扫描该镶木表的每个文件夹中的每个part
文件.仅当您具有下推式过滤功能时,Spark才会使用每个part
文件的页脚(存储了最小值,最大值和计数统计信息)来确定您的搜索值是否在该范围内.如果是,Spark将完全读取该文件.如果不是,Spark将跳过整个文件,至少不会花费您全部读取的费用.
2) When you filter on some column that isn't in your partition, Spark will scan every part
file in every folder of that parquet table. Only when you have pushdown filtering, Spark will use the footer of every part
file (where min, max and count statistics are stored) to determine if your search value is within that range. If yes, Spark will read the file fully. If not, Spark will skip the whole file, not costing you at least the full read.
这篇关于适用于Azure Data Lake的Spark谓词下推,筛选和分区修剪的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!