本文介绍了Jupyter + EMR + Spark-从本地计算机上的Jupyter笔记本连接到EMR群集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是PySpark和EMR的新手.
我正在尝试通过Jupyter笔记本访问在EMR群集上运行的Spark,但遇到错误.

I am new to PySpark and EMR.
I am trying to access Spark running on EMR cluster through Jupyter notebook, but running into errors.

我正在使用以下代码生成SparkSession:

I am generating SparkSession using following code:

spark = SparkSession.builder \
    .master("local[*]")\
    .appName("Carbon - SingleWell parallelization on Spark")\
    .getOrCreate()

尝试以下操作以访问远程群集,但出错:

Tried following to access Remote cluster, but it errored out:

spark = SparkSession.builder \
    .master("spark://<remote-emr-ec2-hostname>:7077")\
    .appName("Carbon - SingleWell parallelization on Spark")\
    .getOrCreate()

错误:

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NullPointerException
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:567)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

任何帮助解决此问题的方法将不胜感激.

Any help resolving this would be much appreciated.

推荐答案

EMR群集为您提供了Jupyter和JupyterHub 自EMR版本5.14.0起.

EMR clusters have Jupyter and JupyterHub provisioned for you since EMR version 5.14.0.

很可能通过一些额外的引导操作来优化那些预配置的服务,而不是连接本地进程以与EMR主节点进行通信.

Most likely, it is easier to tune those provisioned services up with some extra bootstrap actions than to wire up your local process to talk to the EMR master node.

这篇关于Jupyter + EMR + Spark-从本地计算机上的Jupyter笔记本连接到EMR群集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 14:42
查看更多