Spark从hive中选择还是从文件中选择更好 | Spark从hive中选择还是从文件中选择更好

本文介绍了Spark从hive中选择还是从文件中选择更好的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我只是想知道人们对从 Hive 读取与从 .csv 文件或 .txt 文件或 .ORC 文件或 .parquet 文件读取有何看法.假设底层 Hive 表是一个具有相同文件格式的外部表，您更愿意从 Hive 表中读取还是从底层文件本身读取，为什么?

迈克

解决方案

tl;dr : 我会直接从镶木地板文件中读取它

我使用的是 Spark 1.5.2 和 Hive 1.2.1对于 500 万行 X 100 列的表，我记录的一些时间是

val dffile = sqlContext.read.parquet("/path/to/parquets/*.parquet")val dfhive = sqlContext.table("db.table")

dffile 计数 --> 0.38s;dfhive 计数 --> 8.99s

dffile sum(col) --> 0.98s;dfhive sum(col) --> 8.10s

dffile substring(col) --> 2.63s;dfhive 子字符串(col) --> 7.77s

dffile where(col=value) --> 82.59s;dfhive where(col=value) --> 157.64s

请注意，这些是使用旧版本的 Hive 和旧版本的 Spark 完成的，因此我无法评论这两种读取机制之间如何提高速度

I was just wondering what people's thoughts were on reading from Hive vs reading from a .csv file or a .txt file or an .ORC file, or a .parquet file. Assuming the underlying Hive table is an external table that has the same file format, would you rather read form a Hive table or from the underlying file itself, and why?

Mike

解决方案

tl;dr : I would read it straight from the parquet files

I am using Spark 1.5.2 and Hive 1.2.1For a 5Million row X 100 column table some timings I've recorded are

val dffile = sqlContext.read.parquet("/path/to/parquets/*.parquet")
val dfhive = sqlContext.table("db.table")

Note that these were done with an older version of Hive and an older version of Spark so I can't comment on how speed improvements could have occurred between the two reading mechanisms

这篇关于Spark从hive中选择还是从文件中选择更好的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！