问题描述
我刚刚在ubuntu18.04笔记本电脑中安装了pyspark2.4.5,当我运行以下代码时,
I have just installed pyspark2.4.5 in my ubuntu18.04 laptop, and when I run following codes,
#this is a part of the code.
import pubmed_parser as pp
from pyspark.sql import SparkSession
from pyspark.sql import Row
medline_files_rdd = spark.sparkContext.parallelize(glob('/mnt/hgfs/ShareDir/data/*.gz'), numSlices=1000)
parse_results_rdd = medline_files_rdd.\
flatMap(lambda x: [Row(file_name=os.path.basename(x), **publication_dict)
for publication_dict in pp.parse_medline_xml(x)])
medline_df = parse_results_rdd.toDF()
# save to parquet
medline_df.write.parquet('raw_medline.parquet', mode='overwrite')
medline_df = spark.read.parquet('raw_medline.parquet')
我收到这样的错误,
medline_files_rdd = spark.sparkContext.parallelize(glob('/mnt/hgfs/ShareDir/data/*.gz'), numSlices=1000)
NameError: name 'spark' is not defined
我在StackOverflow上看到了类似的问题,但所有问题都无法解决我的问题.有人能帮助我吗?非常感谢.
I have seen similiar questions on StackOverflow, but all of them can not solve my problem.Does anyone can help me?Thanks a lot.
顺便说一句,我是spark的新手,如果我只想在Python中使用spark,它的作用足以我通过使用来安装pyspark pip install pyspark
吗?还有其他我该怎么办?我应该安装Hadoop还是其他?
By the way, I am new in spark, if I just want to use spark in Python, does it enough that I just install pyspark by usingpip install pyspark
? any others should I do? Should I install Hadoop or others?
推荐答案
只需在开始时创建spark会话
Just create spark session in the starting
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
这篇关于NameError:名称"spark"未定义,如何解决?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!