问题描述
我正在AWS EC2上运行PySpark脚本.它在Jupyter笔记本电脑上运行得很好,但是当我在IPython Shell上运行它时,会出现导入错误.看起来好奇怪!有人可以帮忙吗?这是代码段:
I am running a PySpark script on AWS EC2. It runs very well on Jupyter notebook, however when I run it on an IPython shell, it gives import error. It looks so weird! Can anybody help, please.Here's a snippet of the code:
from __future__ import division
from pyspark import SparkContext
from pyspark.sql import SQLContext,SparkSession
from pyspark.sql.functions import lower, col,trim,udf,struct,isnan,when
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType,FloatType,ArrayType,Row
from pyspark.sql.functions import lit
import gc
import time
import pandas as pd
from collections import defaultdict
import numpy as np
sc = SparkContext(appName="Connect Spark with Redshift")
sql_context = SQLContext(sc)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", 'xyz')
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", 'pqr')
spark=SparkSession.builder.master("local").appName("Users").getOrCreate()
users=pd.read_pickle(candidate_users_path)
sqlCtx = SQLContext(sc)
users = sqlCtx.createDataFrame(users)
users.count()
在import语句(第二行)上给出错误.有趣的是,它在从同一位置启动的Jupyter笔记本上运行得非常漂亮.而且,如果我只是在IPython中执行该import语句,则相同的import语句也可以工作.以我的理解,此EC2充当工作程序和主服务器,那么如何在工作程序中不可用呢?Py4JJavaError:调用o57.count时发生错误.:org.apache.spark.SparkException:作业由于阶段失败而中止:阶段5中的任务5失败1次,最近一次失败:阶段0.0中的任务5.0丢失(TID 5,本地主机,执行程序驱动程序):org.apache.spark.SparkException:来自python worker的错误ImportError:无法导入名称"SparkContext"PYTHONPATH是:/home/ubuntu/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip:/home/ubuntu/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/ubuntu/spark-2.4.3-bin-hadoop2.7/jars/spark-core_2.11-2.4.3.jarorg.apache.spark.SparkException:pyspark.daemon的stdout中没有端口号在org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204)在org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)在org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)在org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
It gives error at the import statement (2nd line). Funny part is it's running so beautifully on Jupyter notebook launched from same location. And, the same import statement is working if I just execute that import statement in IPython. In my understanding, this EC2 acts as worker and master, then how can it be not available in the worker?Py4JJavaError: An error occurred while calling o57.count.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost, executor driver): org.apache.spark.SparkException:Error from python worker ImportError: cannot import name 'SparkContext'PYTHONPATH was: /home/ubuntu/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip:/home/ubuntu/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/ubuntu/spark-2.4.3-bin-hadoop2.7/jars/spark-core_2.11-2.4.3.jarorg.apache.spark.SparkException: No port number in pyspark.daemon's stdout at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
推荐答案
我发现问题是Spark正在使用旧版本的Python.在 bashrc
中添加了以下行. alias python = python3
I found that the issue is that Spark was using older version of Python. Added the below line in bashrc
. alias python=python3
bashrc
中的其他行包括:
export SPARK_HOME ="/home/ubuntu/spark-2.4.3-bin-hadoop2.7"
导出PYSPARK_PYTHON =/usr/bin/python3
导出PYSPARK_DRIVER_PYTHON =/usr/bin/python3
这篇关于ImportError:无法在IPython中导入名称"SparkContext"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!