问题描述
在将其转换为AWS EMR jupyterhub中的pandas数据框后,我试图使用matplotlib绘制火花数据集.
I'm trying to plot spark dataset using matplotlib after converting it to pandas dataframe in AWS EMR jupyterhub.
我可以使用matplotlib在单个单元格中进行绘制,如下所示:
I'm able to plot in a single cell using matplotlib like below:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
df = [1, 1.6, 3, 4.2, 5, 4, 2.5, 3, 1.5]
plt.plot(df)
现在,上面的代码段对我来说非常整洁.
在此示例示例之后,我继续从AWS-EMR Jupyterhub中的新单元/多个单元中绘制熊猫数据框,如下所示:
After this sample example, I moved ahead to plot my pandas dataframe from a new/multiple cells in AWS-EMR Jupyterhub like this:
-Cell 1-
sparkDS=spark.read.parquet('s3://bucket_name/path').cache()
-Cell 2-
from pyspark.sql.functions import *
sparkDS_groupBy=sparkDS.groupBy('col1').agg(count('*').alias('count')).orderBy('col1')
pandasDF=sparkDS_groupBy.toPandas()
-cell 3-
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.plot(pandasDF)
我的代码仅在单元格3中失败,并显示以下错误:
My code just fails in cell 3 with the following error:
NameError:未定义名称"pandasDF"
NameError: name 'pandasDF' is not defined
有人知道怎么了吗?
为什么我的jupyterhub笔记本中的新单元格无法识别前一个单元格中的变量?
Why the new cell in my jupyterhub notebook is not able to recognize a variable from the previous cell?
它是否必须使用'%matplotlib inline'魔术命令(我也尝试过'%matplotlib notebook',但失败了)?
Does it have to do something with the '%matplotlib inline' magic command (I tried with '%matplotlib notebook' also, but failed)?
ps:我正在使用AWS 5.19 EMR-Jupyterhub笔记本设置进行绘图工作.
ps: I'm using AWS 5.19 EMR-Jupyterhub notebook setup for my plotting work.
此错误有点类似于此错误,但并非重复如何使matplotlib工作在AWS EMR Jupyter笔记本中?
推荐答案
您需要通过在单元格中键入%%help
来研究%%spark -o df_name
和%%local
函数.
You'll want to look into the %%spark -o df_name
and %%local
functions, by typing %%help
in a cell.
具体来说,请尝试以下操作:
Specifically, in your case try:
- 在
-Cell 2-
的开头使用%%spark -o sparkDS_groupBy
, - 从
%%local
开始-Cell 3-
, - 然后在
-Cell 3-
中而不是pandasDF
中绘制sparkDS_groupBy
.
- Use
%%spark -o sparkDS_groupBy
at the start of-Cell 2-
, - Start
-Cell 3-
with%%local
, - And plot
sparkDS_groupBy
in-Cell 3-
instead ofpandasDF
.
对于上下文较少的用户,可以通过使用PySpark内核在EMR Notebook中实现以下内容来获得图表,该内核附加到至少5.26.0版的EMR集群(引入了笔记本范围内的图书馆.
(每个代码块代表一个单元格)
(each code block represents a Cell)
%% help
%%configure -f
{ "conf":{
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
}}
sc.install_pypi_package("matplotlib")
%%spark -o my_df
# in this cell, my_df is a pyspark.sql.DataFrame
my_df = sc.read.text("s3://.../...")
%%local
%matplotlib inline
import matplotlib.pyplot as plt
# in this cell, my_df is a pandas.DataFrame
plt.plot(my_df)
这篇关于%matplotlib内联魔术命令无法从AWS-EMR Jupyterhub Notebook中的先前单元读取变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!