配置Spark以与Jupyter

配置Spark以与Jupyter

本文介绍了配置Spark以与Jupyter Notebook和Anaconda一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我花了几天的时间尝试使Spark与Jupyter Notebook和Anaconda一起使用.这是我的.bash_profile的样子:

I've spent a few days now trying to make Spark work with my Jupyter Notebook and Anaconda. Here's what my .bash_profile looks like:

PATH="/my/path/to/anaconda3/bin:$PATH"

export JAVA_HOME="/my/path/to/jdk"
export PYTHON_PATH="/my/path/to/anaconda3/bin/python"
export PYSPARK_PYTHON="/my/path/to/anaconda3/bin/python"

export PATH=$PATH:/my/path/to/spark-2.1.0-bin-hadoop2.7/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
export SPARK_HOME=/my/path/to/spark-2.1.0-bin-hadoop2.7
alias pyspark="pyspark --conf spark.local.dir=/home/puifais --num-executors 30 --driver-memory 128g --executor-memory 6g --packages com.databricks:spark-csv_2.11:1.5.0"

当我键入/my/path/to/spark-2.1.0-bin-hadoop2.7/bin/spark-shell时,可以在命令行shell中正常启动Spark.并且输出sc不为空.似乎工作正常.

When I type /my/path/to/spark-2.1.0-bin-hadoop2.7/bin/spark-shell, I can launch Spark just fine in my command line shell. And the output sc is not empty. It seems to work fine.

当我键入pyspark时,它将正常启动Jupyter Notebook.当我创建一个新的Python3笔记本时,出现此错误:

When I type pyspark, it launches my Jupyter Notebook fine. When I create a new Python3 notebook, this error appears:

[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:

我的Jupyter笔记本中的sc是空的.

And sc in my Jupyter Notebook is empty.

任何人都可以帮助解决这种情况吗?

Can anyone help solve this situation?

只需澄清一下:错误结束时在冒号之后没有任何内容.我还尝试使用此帖子创建自己的启动文件,在这里引用,所以您不必走看那里:

Just want to clarify: There is nothing after the colon at the end of the error. I also tried to create my own start-up file using this post and I quote here so you don't have to go look there:

from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("yarn-client")
sc = SparkContext(conf = conf)

并将其放置在〜/.ipython/profile_default/startup/目录中

and placed it in the ~/.ipython/profile_default/startup/ directory

当我这样做时,错误就变成了:

When I did this, the error then became:

[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:
[IPKernelApp] WARNING | Unknown error in handling startup files:

推荐答案

Conda可以帮助正确管理很多依赖项...

Conda can help correctly manage a lot of dependencies...

安装火花.假设spark安装在/opt/spark中,请将其包含在〜/.bashrc中:

Install spark. Assuming spark is installed in /opt/spark, include this in your ~/.bashrc:

export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH

创建一个conda环境,其中包含除spark之外的所有必需依赖项:

Create a conda environment with all needed dependencies apart from spark:

conda create -n findspark-jupyter-openjdk8-py3 -c conda-forge python=3.5 jupyter=1.0 notebook=5.0 openjdk=8.0.144 findspark=1.1.0

激活环境

$ source activate findspark-jupyter-openjdk8-py3

启动Jupyter Notebook服务器:

Launch a Jupyter Notebook server:

$ jupyter notebook

在浏览器中,创建一个新的Python3笔记本

In your browser, create a new Python3 notebook

尝试使用以下脚本(从)

Try calculating PI with the following script (borrowed from this)

import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()

这篇关于配置Spark以与Jupyter Notebook和Anaconda一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-22 19:59