问题描述
我只是想知道人们对从Hive读取与从.csv文件,.txt文件或.ORC文件或.parquet文件进行读取的想法是什么.假设基础Hive表是具有相同文件格式的外部表,是从Hive表中读取还是从基础文件本身中读取,为什么?
I was just wondering what people's thoughts were on reading from Hive vs reading from a .csv file or a .txt file or an .ORC file, or a .parquet file. Assuming the underlying Hive table is an external table that has the same file format, would you rather read form a Hive table or from the underlying file itself, and why?
迈克
推荐答案
tl; dr:我会直接从实木复合地板文件中读取
tl;dr : I would read it straight from the parquet files
我正在使用Spark 1.5.2和Hive 1.2.1对于500万行X 100列的表格,我记录的一些时间安排是
I am using Spark 1.5.2 and Hive 1.2.1For a 5Million row X 100 column table some timings I've recorded are
val dffile = sqlContext.read.parquet("/path/to/parquets/*.parquet")
val dfhive = sqlContext.table("db.table")
dffile sum(col)-> 0.98秒; dfhive sum(col)-> 8.10秒
dffile sum(col) --> 0.98s; dfhive sum(col) --> 8.10s
dffile substring(col)-> 2.63秒; dfhive substring(col)-> 7.77s
dffile substring(col) --> 2.63s; dfhive substring(col) --> 7.77s
dffile where(col = value)-> 82.59秒; dfhive where(col = value)-> 157.64s
dffile where(col=value) --> 82.59s; dfhive where(col=value) --> 157.64s
请注意,这些操作是使用较旧版本的Hive和较旧版本的Spark完成的,因此我无法评论两种阅读机制之间如何实现速度提升
Note that these were done with an older version of Hive and an older version of Spark so I can't comment on how speed improvements could have occurred between the two reading mechanisms
这篇关于Spark从蜂巢中选择还是从文件中选择是更好的选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!