问题描述
我使用 Spark SQL 以 Parquet 格式编写了一些数据,其中生成的架构如下所示:
I write some data in the Parquet format using Spark SQL where the resulting schema looks like the following:
root
|-- stateLevel: struct (nullable = true)
| |-- count1: integer (nullable = false)
| |-- count2: integer (nullable = false)
| |-- count3: integer (nullable = false)
| |-- count4: integer (nullable = false)
| |-- count5: integer (nullable = false)
|-- countryLevel: struct (nullable = true)
| |-- count1: integer (nullable = false)
| |-- count2: integer (nullable = false)
| |-- count3: integer (nullable = false)
| |-- count4: integer (nullable = false)
| |-- count5: integer (nullable = false)
|-- global: struct (nullable = true)
| |-- count1: integer (nullable = false)
| |-- count2: integer (nullable = false)
| |-- count3: integer (nullable = false)
| |-- count4: integer (nullable = false)
| |-- count5: integer (nullable = false)
我还可以将相同的数据转换为更扁平的架构,如下所示:
I can also transform the same data into a more flat schema that looks like this:
root
|-- stateLevelCount1: integer (nullable = false)
|-- stateLevelCount2: integer (nullable = false)
|-- stateLevelCount3: integer (nullable = false)
|-- stateLevelCount4: integer (nullable = false)
|-- stateLevelCount5: integer (nullable = false)
|-- countryLevelCount1: integer (nullable = false)
|-- countryLevelCount2: integer (nullable = false)
|-- countryLevelCount3: integer (nullable = false)
|-- countryLevelCount4: integer (nullable = false)
|-- countryLevelCount5: integer (nullable = false)
|-- globalCount1: integer (nullable = false)
|-- globalCount2: integer (nullable = false)
|-- globalCount3: integer (nullable = false)
|-- globalCount4: integer (nullable = false)
|-- globalCount5: integer (nullable = false)
现在,当我对像 global.count1 这样的列上的第一个数据集运行查询时,它比在第二个数据集中查询 globalCount1 花费的时间要长得多.相反,将第一个数据集写入 Parquet 所需的时间比写入第二个数据集要短得多.我知道由于 Parquet,我的数据以柱状方式存储,但我认为所有嵌套列都将单独存储在一起.例如,在第一个数据集中,似乎整个全局"列被存储在一起,而不是global.count1"、global.count2"等值被存储在一起.这是预期的行为吗?
Now when I run a query on the first data set on a column like global.count1, it takes a lot longer than querying globalCount1 in the second data set. Conversely, writing the first data set into Parquet takes a lot shorter than writing the 2nd data set. I know that my data is stored in a columnar fashion due to Parquet, but I was thinking that all the nested columns would be stored together individually. In the 1st data set for instance, it seems to that the whole 'global' column is being stored together as opposed to 'global.count1', 'global.count2' etc. values being stored together. Is this expected behavior?
推荐答案
有趣.它比查询需要更长的时间.."你能分享多长时间吗?谢谢.
Interesting. "it takes a lot longer than querying.. " can you please share how much longer? thanks.
看代码https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/main/java/parquet/io/RecordReaderImplementation.java#L248 似乎从结构中读取可能会有一些开销.虽然只是查看 Parquet 代码,但它不应该更长".
Looking at the code https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/main/java/parquet/io/RecordReaderImplementation.java#L248 it seems that reading from structures might have some overhead. It shouldn't be "a lot longer" though just looking at the Parquet code.
我认为更大的问题是 Spark 如何在这种情况下下推谓词.例如,在这种情况下它可能无法使用布隆过滤器.您能否分享您在两种情况下和时间上查询数据的方式.哪些版本的 Spark、Parquet、Hadoop 等?
I think bigger problem is how Spark can push down predicates in such cases. For example, it may not be able to use bloom filters in such cases. Can you please share how you query data in both cases and timings. Which versions of Spark, Parquet, Hadoop etc?
Parquet-1.5 有问题 https://issues.apache.org/jira/浏览/PARQUET-61 在某些情况下可能会导致 4-5 倍的速度下降.
Parquet-1.5 had issue https://issues.apache.org/jira/browse/PARQUET-61 which in some such cases could cause 4-5x slowdown.
这篇关于为什么 Spark SQL 中嵌套列的查询性能不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!