问题描述
我刚刚可以访问spark 2.0;到目前为止,我一直在使用spark 1.6.1.有人可以帮我使用pyspark(python)设置sparkSession吗?我知道在线上提供的Scala示例是相似的(这里),但我希望能直接使用python语言进行演练.
I just got access to spark 2.0; I have been using spark 1.6.1 up until this point. Can someone please help me set up a sparkSession using pyspark (python)? I know that the scala examples available online are similar (here), but I was hoping for a direct walkthrough in python language.
我的具体情况:我正在从齐柏林飞艇(Seppelin)Spark笔记本中的S3加载avro文件.然后构建df并运行各种pyspark& sql查询了他们.我所有的旧查询都使用sqlContext.我知道这是不好的做法,但是我用
My specific case: I am loading in avro files from S3 in a zeppelin spark notebook. Then building df's and running various pyspark & sql queries off of them. All of my old queries use sqlContext. I know this is poor practice, but I started my notebook with
sqlContext = SparkSession.builder.enableHiveSupport().getOrCreate()
.
我可以使用
mydata = sqlContext.read.format("com.databricks.spark.avro").load("s3:...
并建立没有问题的数据框.但是,一旦我开始查询数据帧/临时表,就会不断收到"java.lang.NullPointerException"错误.我认为这表明存在翻译错误(例如,旧查询在1.6.1中有效,但需要针对2.0进行调整).无论查询类型如何,都会发生该错误.所以我假设
and build dataframes with no issues. But once I start querying the dataframes/temp tables, I keep getting the "java.lang.NullPointerException" error. I think that is indicative of a translational error (e.g. old queries worked in 1.6.1 but need to be tweaked for 2.0). The error occurs regardless of query type. So I am assuming
1.)sqlContext别名是个坏主意
1.) the sqlContext alias is a bad idea
和
2.)我需要正确设置sparkSession.
2.) I need to properly set up a sparkSession.
因此,如果有人可以向我展示如何完成此工作,或者可以解释他们知道的不同版本的spark之间的差异,我将不胜感激.请让我知道是否需要详细说明这个问题.我很抱歉,如果它令人费解.
So if someone could show me how this is done, or perhaps explain the discrepancies they know of between the different versions of spark, I would greatly appreciate it. Please let me know if I need to elaborate on this question. I apologize if it is convoluted.
推荐答案
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
现在要导入一些可以使用的.csv文件
now to import some .csv file you can use
df=spark.read.csv('filename.csv',header=True)
这篇关于如何使用pyspark在Spark 2.0中构建sparkSession?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!