本文介绍了Apache Spark SQLContext与HiveContext有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
Apache Spark SQLContext和HiveContext之间有什么区别?有些消息来源说,由于HiveContext是SQLContext的超集,开发人员应始终使用HiveContext,它的功能比SQLContext多。但是每个上下文的当前API大部分是相同的。
- SQLContext / HiveContext更有用的场景是什么?
- HiveContext只有在使用Hive时才更有用?
- 或者SQLContext是使用Apache Spark实现大数据应用程序所需的全部内容? h2_lin>解决方案
Spark 2.0 +
Spark 2.0提供本机窗口功能(),并且在解析方面还有一些额外的改进,以及更好的SQL 2003合规性,所以它的依赖性显着降低在Hive上实现核心功能,并且因为 HiveContext
( SparkSession
带有Hive支持)似乎不那么重要。
Spark< 2.0
显然,如果您想使用Hive,您必须使用 HiveContext
。除此之外,与目前最大的区别(Spark 1.5)是对以及访问Hive UDF的能力。
一般来说,窗口函数是一个非常酷的功能,可以用简洁的方式解决相当复杂的问题,而无需在RDD和DataFrame之间来回切换。性能仍然远未达到最佳状态,特别是没有 PARTITION BY
子句,但它实际上完全不是Spark特有的。
关于Hive UDF它现在不是一个严重的问题,但在Spark 1.5之前,许多SQL函数已经使用Hive UDF表示并且需要
HiveContext
才能工作。
HiveContext
还提供了更健壮的SQL解析器。例如:
最后,启动Thrift服务器需要 HiveContext
。
$ b $ p <$> c $ c> HiveContext 就是它有很大的依赖关系。
What are the differences between Apache Spark SQLContext and HiveContext ?
Some sources say that since the HiveContext is a superset of SQLContext developers should always use HiveContext which has more features than SQLContext. But the current APIs of each contexts are mostly same.
- What are the scenarios which SQLContext/HiveContext is more useful ?.
- Is HiveContext more useful only when working with Hive ?.
- Or does the SQLContext is all that needs in implementing a Big Data app using Apache Spark ?
解决方案
Spark 2.0+
Spark 2.0 provides native window functions (SPARK-8641) and features some additional improvements in parsing and much better SQL 2003 compliance so it is significantly less dependent on Hive to achieve core funcionality and because of that HiveContext
(SparkSession
with Hive support) seems to be slightly less important.
Spark < 2.0
Obviously if you want to work with Hive you have to use HiveContext
. Beyond that the biggest difference as for now (Spark 1.5) is a support for window functions and ability to access Hive UDFs.
Generally speaking window functions are a pretty cool feature and can be used to solve quite complex problems in a concise way without going back and forth between RDDs and DataFrames. Performance is still far from optimal especially without PARTITION BY
clause but it is really nothing Spark specific.
Regarding Hive UDFs it is not a serious issue now, but before Spark 1.5 many SQL functions have been expressed using Hive UDFs and required HiveContext
to work.
HiveContext
also provides more robust SQL parser. See for example: py4j.protocol.Py4JJavaError when selecting nested column in dataframe using select statetment
Finally HiveContext
is required to start Thrift server.
The biggest problem with HiveContext
is that it comes with large dependencies.
这篇关于Apache Spark SQLContext与HiveContext有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!